CSC-272 Exam #1 February 13, 2015 Name Questions are weighted as indicated. Show your work and state your assumptions for partial credit consideration. Unless explicitly stated, there are NO intended errors and NO trick questions. If in doubt, ask! You have 50 minutes to work. Now, take a moment to relax. If you don't immediately see how to do something, THINK! Don't panic! Multiple Choice (2 points each): 1. Data used to build a data mining model is called a. validation data b. training data c. test data d. hidden data 2. Supervised learning differs from unsupervised learning in that supervised learning requires a. at least one input attribute b. input attributes to be nominal c. at least one output attribute d. ouput attributes to be nominal 3. A nearest neighbor approach is best used a. with large-sized datasets b. when irrelevant attributes have been removed from the data c. when a generalized model of the data is desireable d. when an explanation of what has been found is of primary importance 4. Classification problems are distinguished from estimation problems in that a. classification problems require the output attribute to be numeric b. classification problems require the output attribute to be nominal c. classification problems do not allow an output attribute d. classification problems are designed to predict future outcome
5. Which statement is true about prediction problems? a. The output attribute must be nominal b. The output attribute must be numeric c. The resultant model is designed to determine future outcomes d. The resultant model is designed to classify current behavior 6. Unlike traditional classification rules, association rules a. allow the same attribute to be an input attribute in one rule and an output attribute in another rule b. allow more than one input attribute in a single rule c. require input attributes to take on numeric values d. require each rule to have exactly one categorical output attribute 7. Association rule support is defined as a. the percentage of instances that contain the antecendent conditional items listed in the association rule b. the percentage of instances that contain the consequent conditions listed in the association rule c. the percentage of instances that contain all items listed in the association rule d. the percentage of instances that contain at least one of the antecedent conditional items listed in the association rule 8. Eric Siegel used predictive analytics to choose which ad to display based on: a. the revenue the ad generated on average b. the click-throughs the ad generated on average c. the percent likelihood that a specific user would click on the ad d. the percent likelihood that anyone would click on the ad
Fill in the Blank (2 points each blank): Use the following list of terms to fill in the blanks with the best possible term. More than one answer might be justifiable. Resulting sentences are not necessarily grammatically correct. big data mixed attributes ARFF model dataset sparse data database query decision tree outliers class attribute linear regression formula visualization classification overfitting instance-based machine learning data decision boundary rules information nodes market basket analysis instance leaves supervised learning association regression tree unsupervised learning numeric estimation antecedent clustering test data consequent classification rule flat file decision list association rule denormalization support nominal data warehouse confidence ordinal overlay nearest neighbor.is a row in a dataset. is employed to help evaluate the quality of a model. 2D and 3D graphs are a in a dataset..tool that can be used to detect no value. effective dataset. attributes of the dataset. results when a majority of attributes in a large dataset have describes the values hot > mild > cold. is supplemental data sometimes required to build an results from giving too much emphasis to non-predictive of a rule is determined by dividing the rule s support by the number of instances the rule s applies to.
Short Answer (6 points each): Give (relatively) short answers to the following questions. You must omit any one question by writing OMIT clearly in the space provided. 1. Consider the following decision tree: Business Appointment? No Yes Temp above 70? Decision = wear slacks No Decision = wear jeans Yes Decision = wear shorts a. This is the output of what type of data mining algorithm? b. How would the new instance Raining = Yes; Business Appointment = No; Temp above 70? = Yes be classified by the tree? 2. a. Suppose a dataset with 100 instances was used to create the tree from problem 1 using the J48 algorithm. Why wouldn t it be a good idea to use these 100 instances as the test data to evaluate the model? b. Briefly describe either the ten-fold cross-validation strategy or the percentage split strategy for evaluating such a decision tree.
3. Consider each of the following data mining scenarios. For each one, state whether the most appropriate technique to use is classification learning, numeric estimation, or association learning. a. Determine a freshman s likely first-year grade point average from the student s combined SAT score, high school class standing, and total number of high school science and math credits. b. Develop a model to determine if an individual is a good candidate for a home mortgage loan. c. Determine what factors among age, gender, income, type of vehicle, and state of residence are most indicative of individuals who have received three or more traffic violations in the past year. d. Diagnose a patient s illness based on the presence or absence of symptoms including fever, sore throat, swollen glands, headache, and congestion. e. Develop a model to predict the likelihood (as a percentage) that a given Furman alum is likely to respond to a solicitation letter with a donation. f. Determine which products women in their 20 s are most likely to purchase together at Amazon, based on previous purchase data. 4. Consider a successful predictive model such as the one for stock prediction discussed by Siegel. Can it be assumed to be equally successful in the future as it is today? Why or why not. Defend your answer with a good, specific technical explanation. (Do not be vague. Your answer should reveal a general truth about data mining.)
5. Consider the following hypothetical data set: For these questions, your rules don t necessarily have to be real. You re not expected to do data mining in your head. Give a plausible answer that shows you know what the terms mean. a. What is a plausible example of a classification rule that might be learned from this data? Be sure to identify the class attribute. b. What is a plausible example of an association rule? What differentiates it from the classification rule? 6. A little prediction goes a long way. If your charitable organization sends mail to 1 million prospects at a cost of $1 each, and 1 of every 100 will donate $100, then you just break even: ($100 x 10,000 responses) ($1 x 1 million) Assume a predictive model that identifies ¼ of your list as being 3 times more likely to donate $100. Modify this formula to show quantitatively the value of this predictive ability. (In other words, how much more revenue will you earn?)
7. With great power comes great responsibility. Imagine a data mining system designed to predict which employees of Google will quit in the next three months. a. Describe one benefit of such a system. b. What is one ethical concern that might arise from the development and use of such a system? 8. Consider the following linear regression formula: University GPA = (0.675)(High School GPA) + 1.097 a. This is the output of what type of data mining algorithm? b. What does the model predict as the college GPA for a student who earned a 3.0 in high school?
9. Consider the following ARFF data file: @relation census @attribute salary_level { high, medium, low } @attribute age numeric @attribute sex { female, male } @data 39, female, low 50, male, low 52, female, medium What is the formatting error in this file? How should it be fixed? HAVE A GREAT WEEKEND!