Chapter 1 Defining and Collecting Data

Size: px
Start display at page:

Download "Chapter 1 Defining and Collecting Data"

Transcription

1 Chapter 1 Defining and Collecting Data

2 1.3 Collecting Data After defining the variables that you want to study, you can proceed with data collection task. Collecting data is a critical task because if you collect data that you are flawed by biases, ambiguities, or other types of errors, the results you will get from using such data with even the most sophisticated statistical methods will be suspect or in error. Data collection consists of identifying data sources, deciding whether the data you collect will be from a population or a sample, cleaning your data, and sometimes recording variables.

3 Data Sources You collect data from either primary or secondary data sources. You are using a primary data source if you collect your own data for analysis. You are using a secondary data source if the data for your analysis have been collected by someone else. You collect data by using any of the following:! Data distributed by an organization or individual! The outcomes of a designed experiment! The responses from a survey! The results of conducting an observational study! Data collected by ongoing business activities

4 ! Primary Sources: The data collector is the one using the data for analysis! Data from a political survey! Data collected from an experiment! Observed data! Secondary Sources: The person performing data analysis is not the data collector! Analyzing census data! Examining data from print journals or data published on the internet.

5 Examples Of Data Distributed By Organizations or Individuals! Financial data on a company provided by investment services.! Industry or market data from market research firms and trade associations.! Stock prices, weather conditions, and sports statistics in daily newspapers.

6 Examples of Data From A Designed Experiment! Consumer testing of different versions of a product to help determine which product should be pursued further.! Material testing to determine which supplier s material should be used in a product.! Market testing on alternative product promotions to determine which promotion to use more broadly.

7 Examples of Survey Data! A survey asking people which laundry detergent has the best stain-removing abilities! Political polls of registered voters during political campaigns.! People being surveyed to determine their satisfaction with a recent product or service experience.

8 Examples of Data Collected From Observational Studies! Market researchers utilizing focus groups to elicit unstructured responses to open-ended questions.! Measuring the time it takes for customers to be served in a fast food establishment.! Measuring the volume of traffic through an intersection to determine if some form of advertising at the intersection is justified.

9 Examples of Data Collected From Ongoing Business Activities! A bank studies years of financial transactions to help them identify patterns of fraud.! Economists utilize data on searches done via Google to help forecast future economic conditions.! Marketing companies use tracking data to evaluate the effectiveness of a web site.

10 Populations and Samples You collect data from either a population or a sample. POPULATION A population consists of all the items or individuals about which you want to reach conclusions. The population is the large group When you analyze data from a population you compute parameters. SAMPLE A sample is the portion of a population selected for analysis. The results of analyzing a sample are used to estimate characteristics of the entire population. The sample is the small group When you analyze data from a sample you compute statistics.

11 Population vs. Sample Population Sample All the items or individuals about which you want to draw conclusion(s) A portion of the population of items or individuals

12 Data collection will involve collecting data from a sample when any of the following conditions hold:! Selecting a sample is less time consuming than selecting every item in the population.! Selecting a sample is less costly than selecting every item in the population.! Selecting a sample is less cumbersome and more practical than analyzing the entire population.

13 Data Formatting The data you collect may be formatted in more than one way. For examples, suppose that you wanted to collect electronic financial data about a sample of companies. The data you seek to collect could be formatted in any number of ways, including the following:! Tables of data! Contents of standard forms! A continuous data stream, such as a stock ticker! Messages delivered from social media websites and networks

14 These examples illustrate that data can exist either in a structured or unstructured form. Structured data is data that follows some organizing principle or plan, typically a repeating pattern. For example, a simple stock ticker is structured because each entry would have the name of a company, the number of shares last traded, the bid price, and percent change in the stock price that the transaction represents. In a table, each row contains a set of values for the same columns(i.e., variables), and in a set of forms, each form contains the same set of entries. For example, once we identify that the second column of a table or the second entry on a form contains the family name of an individual, then we know that all entries in the second column of the table or all of the second entries in all copies of the form contain the family name of an individual

15 In contrast, unstructured data is data that follows no repeating pattern. For example, if five different persons sent you an message. Concerning that the stock trades of a specific company, that data could be anywhere in the message. You could not reliably count on the name of the company being the first words of each message, and the pricing, volume, and percent change data could appear in any order

16 Data Can Be Formatted and / or Encoded In More Than One Way! Some electronic formats are more readily usable than others.! Different encodings can impact the precision of numerical variables and can also impact data compatibility.! As you identify and choose sources of data you need to consider / deal with these issues

17 Data Cleaning Whatever ways you choose to collect data, you may find irregularities in the values you collect such as undefined or impossible values. For a categorical variable, an undefined value would be a value that does not represent one of the categories defined for the variable. For a numerical variable, an impossible value would be a value that falls outside a defined range of possible values for the variable. For a numerical variable without a defined range of possible values, you might also find outliers, values that seem excessively different from most of the rest of the values. Such values may or may not be errors, but they demand a second review.

18 Values that are missing are another type of irregularity. A missing value is a value that was not able to be collected(and therefore not available to analysis). For example, you would record a nonresponse to a survey question as a missing value. When you spot an irregularity. You may have to clean the data you have collected.

19 Data Cleaning Is Often A Necessary Activity When Collecting Data! Often find irregularities in the data!typographical or data entry errors!values that are impossible or undefined!missing values!outliers! When found these irregularities should be reviewed / addressed! Both Excel & Minitab can be used to address irregularities

20 Recoding Variables After you have collected data, you may discover that you need to reconsider the categories that you have defined for a categorical variable or that you need to transform a numerical variable into a categorical variable by assigning the individual numeric data values to one of several groups. In either case, you can define a recoded variable that supplements or replaces the original variable in your analysis.

21 When recoding variables, be sure that the category definitions cause each data value to be placed in one and only one category, a property known as being mutually exclusive. Also ensure that the set of categories you create for the new, recoded variables include all the data values being recoded, a property known as being collectively exhaustive. If you are recoding a categorical variable, you can preserve one or more of the original categories, as long as your recordings are both mutually exclusive and collectively exhaustive.

22 When recoding numerical variables, pay particular attention to the operational definitions of the categories you create for the recoded variable, especially if the categories are not self defining ranges. For example, while the recoded categories Under 12, 12-20, 21-34, 35-54, and 55 and Over are self defining for age, the categories Child, Youth, Young Adult, Middle Aged, and Senior need their own operational definitions.

23 After Collection It Is Often Helpful To Recode Some Variables! Recoding a variable can either supplement or replace the original variable.! Recoding a categorical variable involves redefining categories.! Recoding a quantitative variable involves changing this variable into a categorical variable.! When recoding be sure that the new categories are mutually exclusive (categories do not overlap) and collectively exhaustive (categories cover all possible values).

24 Choose the correct answer 1. The process of using data collected from a small group to reach conclusions about a large group is called a) statistical inference. b) DCOVA framework. c) operational definition. d) descriptive statistics. ANSWER: a

25 2- Those methods involving the collection, presentation, and characterization of a set of data in order to properly describe the various features of that set of data are called a) statistical inference. b) DCOVA framework. c) operational definition. d) descriptive statistics. ANSWER: d

26 3- The collection and summarization of the socioeconomic and physical characteristics of the employees of a particular firm is an example of a) inferential statistics. b) descriptive statistics. c) operational definition. d) DCOVA framework. ANSWER: b

27 4- The estimation of the population average family expenditure on food based on the sample average expenditure of 1,000 families is an example of a) inferential statistics. b) descriptive statistics. c) DCOVA framework. d) operational definition. ANSWER: a

28 5- Which of the following is not an element of descriptive statistical problems? a) An inference made about the population based on the sample. b) The population or sample of interest. c) Tables, graphs, or numerical summary tools. d) Identification of patterns in the data. ANSWER: a

29 6- A study is under way in Yosemite National Forest to determine the adult height of American pine trees. Specifically, the study is attempting to determine what factors aid a tree in reaching heights greater than 60 feet tall. It is estimated that the forest contains 25,000 adult American pines. The study involves collecting heights from 250 randomly selected adult American pine trees and analyzing the results. Identify the variable of interest in the study. a) The age of an American pine tree in Yosemite National Forest. b) The height of an American pine tree in Yosemite National Forest. c) The number of American pine trees in Yosemite National Forest. d) The species of trees in Yosemite National Forest. ANSWER: b

30 7- Most analysts focus on the cost of tuition as the way to measure the cost of a college education. But incidentals, such as textbook costs, are rarely considered. A researcher at Drummand University wishes to estimate the textbook costs of first-year students at Drummand. To do so, she monitored the textbook cost of 250 first-year students and found that their average textbook cost was $600 per semester. Identify the variable of interest to the researcher. a) The textbook cost of first-year Drummand University students. b) The year in school of Drummand University students. c) The age of Drummand University students. d) The cost of incidental expenses of Drummand University students. ANSWER: a

31 8-20. Which of the following is not true about business analytics? a) It enables you to use statistical methods to analyze and explore data to uncover unforeseen relationships. b) It enables you to use management science methods to develop optimization models that impact an organization s strategy, planning, and operations. c) It enables you to use complex mathematics to replace the need for organizational decision making and problem solving. d) It enables you to use information systems methods to collect and process data sets of all sizes. ANSWER: c

32 True or false The V in the DCOVA framework stands for analyze. ANSWER: True

33 DCOVA framework stands for! Define the data that you want to study in order to solve a problem or meet an objective.! Collect the data from appropriate sources! Organize the data collected by developing charts! Visualize the data collected by developing charts! Analyze the data collected to reach conclusions and present those results.

34 Problems for Section 1.3 page Assume that a research has been carried out to estimate the rate of return given by all the Initial Public Offerings (IPOs) in the U.K. if they are sold on the first day of listing. The researcher analyzed the returns given by 250 IPOs in the U.K. categorize the data for population and sample Population: Return on all the IPOs in the US. Sample: Return on 250 IPOs in the US.

35 1.13 With reference to the case in 1.12, explain why the researcher chose to collect the returns for 250 IPOs, rather than considering all IPOs Selecting a sample is less time consuming, less cumbersome and less costly than selecting every item in the population. Also, the sample results can be used to derive population results.

36 1.14 Assume that the recorded heights of 10 students are 120, 122, 128, 176, 124, 127, 121, 125, 127, and 129 centimeters. Which number do you think will be the outlier while calculating the average heights of students in the class and why? How Would you deal with this outlier? 1.14 According to the given data, the heights of other students range between 120 to 130 centimeters. Thus, including 176 can lead to misleading results and it ought to be classified as an outlier. Such values may or may not be errors, but they demand a second review.

37 1.15 Transportation engineers and planners want to address the dynamic properties of travel behavior by describing in detail the driving characteristics of drivers over the course of a month. What type of data collection source do you think the transportation engineers and planners should use? 1.15 The transportation engineers and planners should use primary data collected through an observational study of the driving characteristics of drivers over the course of a month.

38 1.16 Visit the website of NASDAQ. Enter the symbol of 2-3 companies one by one. Observe the format in which the results appear for each of these companies. In which format do you think the data appears? 1.16 Since all the data appears in a similar format for all companies, it will be categorized as structured data.

39