CSC-272 Exam #1 February 13, 2015

Similar documents
Statistics 201 Summary of Tools and Techniques

Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Machine Learning - Classification

Building the In-Demand Skills for Analytics and Data Science Course Outline

Salford Predictive Modeler. Powerful machine learning software for developing predictive, descriptive, and analytical models.

Sample Exam 1 Math 263 (sect 9) Prof. Kennedy

C-14 FINDING THE RIGHT SYNERGY FROM GLMS AND MACHINE LEARNING. CAS Annual Meeting November 7-10

Data Mining, CSCI 347, Fall 2017 Exam 1, Sept. 22

Professor Dr. Gholamreza Nakhaeizadeh. Professor Dr. Gholamreza Nakhaeizadeh

BUSINESS DATA MINING (IDS 572) Please include the names of all team-members in your write up and in the name of the file.

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Statistics 201 Spring 2018 Exam 2 Practice Exam (from Fall 2016)

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT

LECTURE 17: MULTIVARIABLE REGRESSIONS I

Data Science Training Course

A is used to answer questions about the quantity of what is being measured. A quantitative variable is comprised of numeric values.

Jialu Yan, Tingting Gao, Yilin Wei Advised by Dr. German Creamer, PhD, CFA Dec. 11th, Forecasting Rossmann Store Sales Prediction

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

STAB22 section 2.1. Figure 1: Scatterplot of price vs. size for Mocha Frappuccino

Chapter 1. QTM1310/ Sharpe. Data and Decisions. 1.1 What Are Data? 1.1 What Are Data? 1.1 What Are Data? 1.1 What Are Data? 1.1 What Are Data?

We know that a line has the form y = mx + b, where m is the slope and b the y - intercept. With two points given, we can write:

Ensemble Modeling. Toronto Data Mining Forum November 2017 Helen Ngo

PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING

Advanced Analytics through the credit cycle

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Predicting Corporate Influence Cascades In Health Care Communities

The Dummy s Guide to Data Analysis Using SPSS

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

Who Are My Best Customers?

Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction

The Gender Gap in Earnings - Explanations II

Measurement and sampling

Introduction to Data,Mining

INFORMS Analytics Maturity Model. User Guide

Machine Learning Based Prescriptive Analytics for Data Center Networks Hariharan Krishnaswamy DELL

CASE STUDY: WEB-DOMAIN PRICE PREDICTION ON THE SECONDARY MARKET (4-LETTER CASE)

A Bid for Every Auction. Implementing and Improving AdWords Automated Bidding

Practical Application of Predictive Analytics Michael Porter

EMPLOYMENT APPLICATION

Credit Card Marketing Classification Trees

SolidQ Data Science Services Fraud Detection

B. What should the price of a bottle of mouthwash be so that the demand is 2000 bottles?

Final Exam Module: Candy Sales Projections

Data Mining Technology and its Future Prospect

Big Data Executive Program

A Review of a Novel Decision Tree Based Classifier for Accurate Multi Disease Prediction Sagar S. Mane 1 Dhaval Patel 2

This paper is not to be removed from the Examination Halls

David Easley and Jon Kleinberg November 29, 2010

Habitat as a predictor of Warbler usage.

THE NEXT WAVE WEBSITE AND PERSONALIZATION. Informs Presentation January 13, 2010

Raj Kumar 1, Ms. Sonia 2 1 Assistant Professor, 2 M.Tech student. Department of CSE Jind Institute of Engineering & Technology, Jind(Haryana)

MARK SCHEME for the October/November 2015 series 9708 ECONOMICS

Math 1 Variable Manipulation Part 8 Working with Data

Math 1 Variable Manipulation Part 8 Working with Data

Tutorial Segmentation and Classification

Chapter 9. Business Intelligence Systems

THE KENT CIVIL SERVICE COMMISSION ANNOUNCES A COMPETITIVE PROMOTIONAL EXAMINATION FOR THE CLASSIFICATION OF SERGEANT WITH THE KENT POLICE DEPARTMENT

Marketing Data Solutions for the Financial Services Industry

Accurate Campaign Targeting Using Classification Algorithms

The usage of Big Data mechanisms and Artificial Intelligence Methods in modern Omnichannel marketing and sales

Two-Way Tables ESSENTIAL QUESTION. How can you use two-way frequency tables to solve real-world problems? Real-World Video. my.hrw.

1. Contingency Table (Cross Tabulation Table)

Predicting Customer Purchase to Improve Bank Marketing Effectiveness

A Systematic Approach to Performance Evaluation

Economics 155/Earth Systems 112 Spring Final Exam

EXAMINERS REPORT ON THE PERFORMANCE OF CANDIDATES CSEE, 2014

ACADEMIC CAREERS: SALARY NEGOTIATIONS

Getting Started with HLM 5. For Windows

in brief The Power of Women s Market Data: A How-to Guide 1. IDENTIFY

Transshipment. Chapter 493. Introduction. Data Structure. Example Model

KnowledgeSTUDIO. Advanced Modeling for Better Decisions. Data Preparation, Data Profiling and Exploration

Final Exam cheat sheet

Data mining and Renewable energy. Cindi Thompson

Predictive Analytics and Machine Learning: An Overview

Data Mining. Chapter 7: Score Functions for Data Mining Algorithms. Fall Ming Li

MANAGING NEXT GENERATION ARTIFICIAL INTELLIGENCE IN BANKING

by Mike Thurber Lead Data Scientist

Improving Online Business. Affiliate programs with Adwords

Midterm Exam. Friday the 29th of October, 2010

Analytics of the Planned Giving Donor

Making Predictions with Experimental Probability

Introduction to Data Mining

Spring 2018 EMGT 5731: Business Analytics

Monday, October 15: Monopoly and Marginal Revenue

ADVANCED DATA ANALYTICS


Faculty Quality Review Examples

Session 15 Business Intelligence: Data Mining and Data Warehousing

Financial Services: Maximize Revenue with Better Marketing Data. Marketing Data Solutions for the Financial Services Industry

Unlocking the Power of Big Data Analytics for Application Security and Security Operation

The Lazy Man s Cash Formula

TECHNICAL NOTE. The Logical Framework

CHAPTER 1 INTRODUCTION TO STATISTICS

The cost of getting water has fallen, so I will drink more of it.

Categorical Predictors, Building Regression Models

NAEW&CF E-3A COMPONENT Civilian Recruitment/Services Section Post Box D Geilenkirchen

Math 101. Make sure that your scantron matches the color of this page. Read ALL directions carefully before beginning the exam.

AP Statistics Scope & Sequence

Targeting & Segmentation

VICTOR VALLEY COLLEGE COOPERATIVE WORK EXPERIENCE EDUCATION CLASS. Homework Assignment #1 Resume & Cover Letter

Transcription:

CSC-272 Exam #1 February 13, 2015 Name Questions are weighted as indicated. Show your work and state your assumptions for partial credit consideration. Unless explicitly stated, there are NO intended errors and NO trick questions. If in doubt, ask! You have 50 minutes to work. Now, take a moment to relax. If you don't immediately see how to do something, THINK! Don't panic! Multiple Choice (2 points each): 1. Data used to build a data mining model is called a. validation data b. training data c. test data d. hidden data 2. Supervised learning differs from unsupervised learning in that supervised learning requires a. at least one input attribute b. input attributes to be nominal c. at least one output attribute d. ouput attributes to be nominal 3. A nearest neighbor approach is best used a. with large-sized datasets b. when irrelevant attributes have been removed from the data c. when a generalized model of the data is desireable d. when an explanation of what has been found is of primary importance 4. Classification problems are distinguished from estimation problems in that a. classification problems require the output attribute to be numeric b. classification problems require the output attribute to be nominal c. classification problems do not allow an output attribute d. classification problems are designed to predict future outcome

5. Which statement is true about prediction problems? a. The output attribute must be nominal b. The output attribute must be numeric c. The resultant model is designed to determine future outcomes d. The resultant model is designed to classify current behavior 6. Unlike traditional classification rules, association rules a. allow the same attribute to be an input attribute in one rule and an output attribute in another rule b. allow more than one input attribute in a single rule c. require input attributes to take on numeric values d. require each rule to have exactly one categorical output attribute 7. Association rule support is defined as a. the percentage of instances that contain the antecendent conditional items listed in the association rule b. the percentage of instances that contain the consequent conditions listed in the association rule c. the percentage of instances that contain all items listed in the association rule d. the percentage of instances that contain at least one of the antecedent conditional items listed in the association rule 8. Eric Siegel used predictive analytics to choose which ad to display based on: a. the revenue the ad generated on average b. the click-throughs the ad generated on average c. the percent likelihood that a specific user would click on the ad d. the percent likelihood that anyone would click on the ad

Fill in the Blank (2 points each blank): Use the following list of terms to fill in the blanks with the best possible term. More than one answer might be justifiable. Resulting sentences are not necessarily grammatically correct. big data mixed attributes ARFF model dataset sparse data database query decision tree outliers class attribute linear regression formula visualization classification overfitting instance-based machine learning data decision boundary rules information nodes market basket analysis instance leaves supervised learning association regression tree unsupervised learning numeric estimation antecedent clustering test data consequent classification rule flat file decision list association rule denormalization support nominal data warehouse confidence ordinal overlay nearest neighbor.is a row in a dataset. is employed to help evaluate the quality of a model. 2D and 3D graphs are a in a dataset..tool that can be used to detect no value. effective dataset. attributes of the dataset. results when a majority of attributes in a large dataset have describes the values hot > mild > cold. is supplemental data sometimes required to build an results from giving too much emphasis to non-predictive of a rule is determined by dividing the rule s support by the number of instances the rule s applies to.

Short Answer (6 points each): Give (relatively) short answers to the following questions. You must omit any one question by writing OMIT clearly in the space provided. 1. Consider the following decision tree: Business Appointment? No Yes Temp above 70? Decision = wear slacks No Decision = wear jeans Yes Decision = wear shorts a. This is the output of what type of data mining algorithm? b. How would the new instance Raining = Yes; Business Appointment = No; Temp above 70? = Yes be classified by the tree? 2. a. Suppose a dataset with 100 instances was used to create the tree from problem 1 using the J48 algorithm. Why wouldn t it be a good idea to use these 100 instances as the test data to evaluate the model? b. Briefly describe either the ten-fold cross-validation strategy or the percentage split strategy for evaluating such a decision tree.

3. Consider each of the following data mining scenarios. For each one, state whether the most appropriate technique to use is classification learning, numeric estimation, or association learning. a. Determine a freshman s likely first-year grade point average from the student s combined SAT score, high school class standing, and total number of high school science and math credits. b. Develop a model to determine if an individual is a good candidate for a home mortgage loan. c. Determine what factors among age, gender, income, type of vehicle, and state of residence are most indicative of individuals who have received three or more traffic violations in the past year. d. Diagnose a patient s illness based on the presence or absence of symptoms including fever, sore throat, swollen glands, headache, and congestion. e. Develop a model to predict the likelihood (as a percentage) that a given Furman alum is likely to respond to a solicitation letter with a donation. f. Determine which products women in their 20 s are most likely to purchase together at Amazon, based on previous purchase data. 4. Consider a successful predictive model such as the one for stock prediction discussed by Siegel. Can it be assumed to be equally successful in the future as it is today? Why or why not. Defend your answer with a good, specific technical explanation. (Do not be vague. Your answer should reveal a general truth about data mining.)

5. Consider the following hypothetical data set: For these questions, your rules don t necessarily have to be real. You re not expected to do data mining in your head. Give a plausible answer that shows you know what the terms mean. a. What is a plausible example of a classification rule that might be learned from this data? Be sure to identify the class attribute. b. What is a plausible example of an association rule? What differentiates it from the classification rule? 6. A little prediction goes a long way. If your charitable organization sends mail to 1 million prospects at a cost of $1 each, and 1 of every 100 will donate $100, then you just break even: ($100 x 10,000 responses) ($1 x 1 million) Assume a predictive model that identifies ¼ of your list as being 3 times more likely to donate $100. Modify this formula to show quantitatively the value of this predictive ability. (In other words, how much more revenue will you earn?)

7. With great power comes great responsibility. Imagine a data mining system designed to predict which employees of Google will quit in the next three months. a. Describe one benefit of such a system. b. What is one ethical concern that might arise from the development and use of such a system? 8. Consider the following linear regression formula: University GPA = (0.675)(High School GPA) + 1.097 a. This is the output of what type of data mining algorithm? b. What does the model predict as the college GPA for a student who earned a 3.0 in high school?

9. Consider the following ARFF data file: @relation census @attribute salary_level { high, medium, low } @attribute age numeric @attribute sex { female, male } @data 39, female, low 50, male, low 52, female, medium What is the formatting error in this file? How should it be fixed? HAVE A GREAT WEEKEND!