Principal components analysis II

Similar documents
MdAIR Spring Institute. April 28, 2006

Ensemble Modeling. Toronto Data Mining Forum November 2017 Helen Ngo

Gasoline Consumption Analysis

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

Sawtooth Software. Sample Size Issues for Conjoint Analysis Studies RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Bivariate Data Notes

COMP6053 lecture: Principal components analysis.

**************************************************** I guess the real problem is this:

How to Get More Value from Your Survey Data

Forecasting Introduction Version 1.7

ALL POSSIBLE MODEL SELECTION IN PROC MIXED A SAS MACRO APPLICATION

All Possible Mixed Model Selection - A User-friendly SAS Macro Application

Methodology The Geographic Assessor & Pay Survey

2016 BOOK OF BUSINESS OUTCOMES REPORT

Control Charts for Customer Satisfaction Surveys

developer.* The Independent Magazine for Software Professionals Automating Software Development Processes by Tim Kitchens

Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction

Chart Recipe ebook. by Mynda Treacy

Document version Custom Metrics and Formulas

What Is Conjoint Analysis? DSC 410/510 Multivariate Statistical Methods. How Is Conjoint Analysis Done? Empirical Example

A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data

Carry out rule-based statistical analysis

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Variable Selection Methods for Multivariate Process Monitoring

The Dummy s Guide to Data Analysis Using SPSS

Trade Examples. Business Ideas Support Services New York 2 hours/idea 0.5 hour/service Omaha 6 hours/idea 1 hours/service

Destination Management. Lecture 6 Results of destination marketing research, and their adoption

DATA PREPROCESSING METHOD FOR COST ESTIMATION OF BUILDING PROJECTS

Variable Selection Methods

Big Data in manufacturing was virtually unheard of twenty years ago, and it is still cutting edge.

Multiple Regression. Dr. Tom Pierce Department of Psychology Radford University

Building Better Models with. Pro. Business Analytics Using SAS Enterprise Guide and. SAS Enterprise Miner. Jim Grayson Sam Gardner Mia L.

Brian Macdonald Big Data & Analytics Specialist - Oracle

Ask the Expert Model Selection Techniques in SAS Enterprise Guide and SAS Enterprise Miner

Integrating Market and Credit Risk Measures using SAS Risk Dimensions software

Can Advanced Analytics Improve Manufacturing Quality?

BST 226 Statistical Methods for Bioinformatics David M. Rocke. February 19, 2014 BST 226 Statistical Methods for Bioinformatics 1

SUSTAINABLE DEVELOPMENT INDICATORS AND ACCOUNTING: TWO SEPARATE WORLDS OR A DIALOGUE PROCESS BY STATISTICIANS, POLITICIANS AND MODELLERS?

Factors that Condition Seasonality in Farmer`s Employment in Farm, in Income, Sales and Price at Farm Level - for the Apple and Tomato Products

Predictive Analytics Cheat Sheet

introduction by Stacey Barr

Agile for Hardware Development

PHD THESIS - summary

The Five Critical SLA Questions

LECTURE NOTE 2 UTILITY

Procedure 3. Statistical Requirements for CEC Test Methods

Runs of Homozygosity Analysis Tutorial

ARIMA LAB ECONOMIC TIME SERIES MODELING FORECAST Swedish Private Consumption version 1.1

Quantification of Harm -advanced techniques- Mihail Busu, PhD Romanian Competition Council

Chapter 6 Lecture - Elasticity: The Responsiveness of Demand and Supply

Practical Considerations for Using Exploratory Factor Analysis in Educational Research

Correlation Between Tree Size and Disc Saw Speed During Felling Using a Wheel-Mounted Feller-Buncher

STP Segmentation, Targeting, Positioning

Critical Equation #2 for Business Leaders. Variance = OM I - OM J. "Understanding and Analyzing Variances" Overview

So Just What Is Sales & Operations Planning?

Example 2: Franklin Company

Real-Time Predictive Modeling of Key Quality Characteristics Using Regularized Regression: SAS Procedures GLMSELECT and LASSO

Job and Work Attitude Determinants: An Application of Multivariate Analysis

Exploring Similarities of Conserved Domains/Motifs

Before You Start Modelling

PRICE, QUALITY AND CONVENIENCE AS PREDICTORS OF CONSUMER PURCHASING OF SUSTAINABLE PRODUCTS. Cinthia Jimenez, B.S.

Chapter 5 Evaluating Classification & Predictive Performance

Unit 2 Economic Models: Trade-offs and Trade

Batch Process Analysis using MPCA/MPLS, A Comparison Study

Transportation Research Forum

MODELLING THE 2009 WORK ENVIRONMENT SURVEY RESULTS THE TECHNICAL REPORT APRIL 2010

Case: An Increase in the Demand for the Product

Avoiding the 10 Most Common Mistakes in Selecting and Implementing e-procurement Solutions. A Coupa Executive White Paper

Supply Chain Management: Forcasting techniques and value of information

SAS Business Knowledge Series

EXPERIMENTAL INVESTIGATIONS ON FRICTION WELDING PROCESS FOR DISSIMILAR MATERIALS USING DESIGN OF EXPERIMENTS

Credit Card Marketing Classification Trees

Software Quality Metrics. Analyzing & Measuring Customer Satisfaction (Chapter 14)

Demand Forecasting of New Products Using Attribute Analysis. Marina Kang

Sad (s2) anxious (ax2) self-confidence (sc2) compliance (co2)

ANALYSING QUANTITATIVE DATA

Introduction to Analytics Tools Data Models Problem solving with analytics

DOES DATA MINING IMPROVE BUSINESS FORECASTING?

Point-3-P Alternative Timber Cruise Procedure Our goal. Our base theory.

Using Factor Analysis to Generate Clusters of Agile Practices

Website Advertisment Guidelines

Using Excel s Analysis ToolPak Add In

XpertHR Podcast. Original XpertHR podcast: 22 September 2017

Math 1314 Lesson 9: Marginal Analysis. Marginal Cost

IMPACT OF SELF HELP GROUP IN ECONOMIC DEVELOPMENT OF RURAL WOMEN WITH REFERENCE TO DURG DISTRICT OF CHHATTISGARH

COMPARISON OF LOGISTIC REGRESSION MODEL AND MARS CLASSIFICATION RESULTS ON BINARY RESPONSE FOR TEKNISI AHLI BBPLK SERANG TRAINING GRADUATES STATUS

Brand Equity and Factors Affecting Consumer s Purchase Intention towards Luxury Brands in Bangkok Metropolitan Area

Linear Regression Analysis of Gross Output Value of Farming, Forestry, Animal Husbandry and Fishery Industries

NEC Networks & System Integration Corporation Customizes Microsoft SharePoint User Experience and Enforces Governance for 8,000 Users with AvePoint

BIO 226: Applied Longitudinal Analysis. Homework 2 Solutions Due Thursday, February 21, 2013 [100 points]

A is used to answer questions about the quantity of what is being measured. A quantitative variable is comprised of numeric values.

Lesson Three: Business Analysis Planning and Monitoring BANA 110 Analyzing Business Needs and Requirements Planning Gary Mesick and Shelly Lawrence,

Getting Started with OptQuest

PMP Exam Preparation Course Project Time Management

Thus, there are two points to keep in mind when analyzing risk:

Yelp Recommendation System Using Advanced Collaborative Filtering

Point Sampling (a.k.a. prism cruising)

How to plan an audit engagement

Enhanced Employee Health, Well-Being, and Engagement through Dependent Care Supports

Will You Fail Your Channels? Are You Helping Your Channels Succeed in the Changing Channel and Technology Landscape?

Transcription:

Principal components analysis II Patrick Breheny November 29 Patrick Breheny BST 764: Applied Statistical Modeling 1/17

svd R SAS Singular value decompositions can be computed in R via the svd function: S <- svd(x) where S$u, S$d, and S$v contain the three elements of the decomposition Thus, principal components can be computed manually provided that we first center and scale the matrix: P <- svd(scale(x)) Patrick Breheny BST 764: Applied Statistical Modeling 2/17

prcomp R SAS However, it is generally more convenient to use the prcomp function instead: P <- prcomp(x,scale=true) where it is worth pointing out that for certain historical reasons, the default is scale=false, but generally, scaling the matrix is advisable It is also possible to specify a subset of variables using a formula interface: prcomp(~ldl+sbp+obesity,data=heart) The components of P that are most useful are P$x, which contains the principal components, P$rotation, which contains the loadings (V), and P$dev, which contains the standard deviations of each of the principal components Patrick Breheny BST 764: Applied Statistical Modeling 3/17

summary R SAS There are several advantages of using prcomp instead of manually computing the principal components via svd; namely, that it provides summary, plot, and predict methods: > summary(p) Importance of components: PC1 PC2 PC3 PC4 Standard deviation 1.698 1.094 1.037 0.972 Proportion of Variance 0.321 0.133 0.120 0.105 Cumulative Proportion 0.321 0.454 0.573 0.678 Patrick Breheny BST 764: Applied Statistical Modeling 4/17

plot plot(p) R SAS P Variances 0.0 0.5 1.0 1.5 2.0 2.5 Patrick Breheny BST 764: Applied Statistical Modeling 5/17

predict R SAS Perhaps the most useful reason to use prcomp over svd, however, is the predict method Suppose you want to use your principal components to make predictions for new data; applying prcomp to the new data does not work, as the new principal components will not be the same as the old ones Fortunately, predict(p,newx) calculates the original principal components of P for the new data Patrick Breheny BST 764: Applied Statistical Modeling 6/17

PROC PRINCOMP R SAS In SAS, principal components analysis is carried out via PROC PRINCOMP: PROC PRINCOMP DATA=X OUT=P; RUN; where it is worth pointing out that SAS has the more sensible default of scaling the matrix, although you do have the option to leave X unscaled The output data set P contains the original variables as well as principal components, named Prin1, Prin2,... Patrick Breheny BST 764: Applied Statistical Modeling 7/17

Supervised vs. unsupervised learning We will now take a more in-depth look at the heart disease data from the previous lecture Before we do so, however, it is worth noting once again that principal components have many uses, and is not solely a method for transforming inputs to regression problems Thus, although this problem has an outcome (CHD) that we are trying to predict, principal components are also widely used in problems were there is no outcome, simply a number of variables for which we are interesting their interrelationships Patrick Breheny BST 764: Applied Statistical Modeling 8/17

Principal components regression Results of a logistic regression fit to the principal components: β SE z value p PC1-0.63 0.08-7.71 < 0.0001 PC2 0.16 0.10 1.68 0.09 PC3 0.36 0.11 3.25 < 0.01 PC4 0.49 0.11 4.32 < 0.0001 PC5-0.46 0.12-3.83 < 0.001 PC6-0.22 0.12-1.79 0.07 PC7-0.13 0.13-0.95 0.34 PC8-0.32 0.17-1.89 0.06 PC9-0.00 0.29-0.00 1.00 Patrick Breheny BST 764: Applied Statistical Modeling 9/17

Pros and cons of principal components regression The obvious advantages of regressing on the principal components are: Variables are ordered in terms of standard error Thus, the also tend to be ordered in terms of statistical significance Variables are orthogonal, so including, say, PC9 in the model has no bearing on, say, PC3 The primary disadvantage is that this model is far more difficult to interpret than a regular logistic regression model Patrick Breheny BST 764: Applied Statistical Modeling 10/17

One of the main reasons why principal components are difficult to interpret is that every variable shows up in every factor (and vice versa) There are literally dozens of approaches that have been proposed to try to alleviate this problem and produce more interpretable components Typically, these approaches are grouped under the umbrella of factor analysis, with sparse principal components analysis a more modern take on the problem Patrick Breheny BST 764: Applied Statistical Modeling 11/17

Exploratory approaches Factor analysis and sparse PCA are automatic approaches, but the problem can be approached in a more exploratory fashion as well or indeed, the two approaches can be combined A combination approach was taken by Harrell et al. in the original paper on the WHO-ARI data set that you have looked at in your homework The initial principal components were rotated using factor analysis, and then scientific judgment was used to pare the components down into compact, interpretable quantities that represented meaningful biological concepts Patrick Breheny BST 764: Applied Statistical Modeling 12/17

Multiple comparisons concerns One key thing to keep in mind is that exploration of factors/components can be done without worrying about multiple comparisons All we are contemplating is various transformations of the design matrix, and we are evaluating them without looking at the outcome, thereby retaining statistical validity If we were to tinker with the components based on whether they improve the significance of the resulting regression, on the other hand, statistical validity would be lost Patrick Breheny BST 764: Applied Statistical Modeling 13/17

Reconsidering factors Regardless of the method used, automatic construction of factors is rarely a good idea if the interpretability of the model is important For example, in the heart data set, age and body fat are highly correlated (0.63), but that doesn t mean it makes sense to think of them as arising from one common underlying factor Indeed, this is one of the goals of regression: the ability to separate out the effects of correlated predictors Patrick Breheny BST 764: Applied Statistical Modeling 14/17

Constructing factors for the heart data We could, for instance, fit a model in which age, tobacco, alcohol, family history, and stress are all included as conventional predictors, but that we use principal components to combine cholesterol, blood pressure, obesity, and adiposity On the positive side, this approach is more parsimonious and produces, on the whole, lower p-values On the negative side, it appears to miss the fact that LDL cholesterol is significantly associated with coronary heart disease Patrick Breheny BST 764: Applied Statistical Modeling 15/17

Condition numbers Besides their use in constructing principal components, the singular values themselves are often useful in detecting multicollinearity A common diagnostic for multicollinearity is the condition number κ, which is simply the largest singular value divided by the smallest A general rule of thumb is that if κ > 15, you have a multicollinearity problem, and if κ > 30, you have a big multicollinearity problem In the heart disease data set, κ = 4, suggesting that although some of the variables are correlated, multicollinearity is not an overwhelming concern Patrick Breheny BST 764: Applied Statistical Modeling 16/17

Final remarks On the whole, the coronary heart disease data set is not particularly well-suited for a principal components approach However, the method can dramatically improve estimation and insight in problems where multicollinearity is a large problem as well as aid in detecting it It is very difficult to make sweeping generalizations about principal components; their use is very much decided on a case by case basis, as the homework assignment hopefully illustrates Patrick Breheny BST 764: Applied Statistical Modeling 17/17