Data Mining, CSCI 347, Fall 2017 Exam 1, Sept. 22

Size: px
Start display at page:

Download "Data Mining, CSCI 347, Fall 2017 Exam 1, Sept. 22"

Transcription

1 Data Mining, CSCI 347, Fall 2017 Exam 1, Sept Supervised learning is best described by: (4 pts.) a. Weka learning which requires user input b. Weka learning which focuses on clusters c. Learning where class values are used in the learning d. Learning where the data is normalized e. Learning about associations amongst the data 2. Normalized data can best be described by: (4 pts.) a. Common input to data mining algorithms b. Data structured to facilitate comparison of attribute values c. Values given for a pre-determined set of attributes d. Parameters for the learning experience e. Data organized into relations so that repetition is minimalized 3. Choose the term which best describes a measurement where values are ordered and measured in fixed and equal units, but zero is not defined: (4 pts.) a. Interval quantity b. Ratio quantity c. Normalized quantity d. Nominal quantity e. Ordinal quantity 4. Data integration can best be described as: (4 pts.) a. Combining datasets b. Removing data which is lacking attribute values c. Identifying and removing outliers from the data d. Converting nominal data to numeric e. Adding a Laplace estimator to data values 5. The process of anonymizing data can best be described as: (4 pts.) a. Discovering the identification of someone from data b. Augmenting data with associations c. Combining data using join relations d. Removing identifying information from the data e. Making the data appear in clusters 1

2 Short Answer 6. Tell the number of total number of combinations of values for the following three attributes with the values given. (6 pts.) Attribute 1: Bagel flavors: Plain, Whole wheat, Cinnamon and raison, Garlic, and Everything Attribute 2: Toasted Yes or No Attribute 3: Condiment: Butter, Cream cheese and Humus 5 flavors and 2 ways of toasting (yes and no) and 3 condiments 5*2*3 combinations 7. What is meant by the resubstitution error? (6 pts.) The error resulting from testing on the training set. 8. Describe what is meant by over fitting. (6 pts.) When training has occurred on a dataset and what is learned is so specific that it performs very well on the training data, but it does not perform well on real world data. 2

3 9. Given a survey where customers indicated, on a scale of 1-10, how likely they are to purchase a product. (1 they never expect to purchase the product; 10 they expect to purchase the product.) Say that the average rating for product A is 3 and the average rating for product B is 6. Does it make sense to say that customers are twice as likely to purchase product B as product A? Why or why not? (6 pts.) It does not make sense because the scale of 1-10 is not an interval scale. A value of 3 is definitely less than the value of 6. However, a customer who answers 6 to the survey is not necessarily twice as likely to purchase the product as a customer who answers Decision trees and rules are similar in some ways and different in others. Discuss the similarities and differences. (6 pts.) Trees have a strict format, while rules are much more flexible. Given a tree, it is easy to list an equivalent set of rules. Given a set of rules, it may or may not be possible to draw a tree which captures the same information. Trees are always traversed the same way. A set of rules, on the other hand, may be executed in a particular order, or in any order. Rules may have to be executed in a specific order 3

4 11. Knowledge can be in the form of structural patters such as tables, trees, rules and numeric functions. Describe each type of structural pattern. Follow your description with an example of the pattern using the numerical weather dataset. Numerical Weather Dataset Trees (5 pts.) Description: Divide and conquer approach. In a tree the attributes are nodes, values are branches coming out of the nodes, and leaves of the tree are the classifications. 4

5 Rules (5 pts.) Description: Rules are nuggets of information, can be used for classifying or just for seeing relations amongst the data. Their format is: IF. THEN (can have an ELSE ). Rules are more flexible than trees but can be harder to visualize. IF outlook = sunny THEN play = no Equations or Functions Description: (4 pts.) Mathematical formulas (these work most naturally with numeric attributes) where a numeric class value is given as the sum of weighted attributes and a constant, which is called the bias. Since the class value is not numeric, structural patterns in the form of an equation or function are not appropriate for this dataset. An example of a structural pattern in the form of equations or functions for the CPU performance dataset is: Predicting cpu performance cpu = 0.64 * maxm * chman This question was not counted, as the dataset was not good for exemplifying this structural pattern. 5

6 Description: Table (Decision Table) (4 pts.) Decision tables lists all possible combinations of attribute values and for each gives a result. Outlook Temperature Humidity Windy Play sunny true yes sunny false yes sunny 80 <70 true yes sunny 80 <70 false no overcast true yes overcast false no overcast 80 <70 true yes overcast 80 <70 false yes rainy true yes rainy false no rainy 80 <70 true yes rainy 80 <70 false yes 6

7 12. Describe the process of 3 fold evaluation of the Numeric Weather Dataset. Be specific in your description. List the instances in each fold, using the No. attribute (which is not actually part of the dataset), describe the entire process and tell what will result. Remember that the folds will be stratified. (10 pts.) 14 instances, 9 yes and 5 no. 3 folds will break the dataset into 3 stratified sets. 5 instances 3 yes and 2 no {1,2,3,4,5} 5 instances 3 yes and 2 no {6,7,8,9,10} 4 instances 3 yes and 1 no {11, 12,13,14} First training will occur on the first 2 folds {1-10} and the last fold will be used for testing. Second training will occur on the first and 3 rd folds {1-5, 11-14} and the middle fold will be used for testing. Finally training will occur on the last two folds {6-14} and the first fold will be used for testing. The results will be averaged to give the final evaluation score. 13. Say that 0R training is used. Tell what the final result will be, showing your work. (10 pts.) First training 0R will predict yes. Testing will give 3 correct and 1 wrong or 75% correct. Second training will predict yes. Testing will give 3 correct out of 5, or 60%. Finally training will be like the second. The results will be averaged ( )/3 or 65% correct. 7