Composite Performance Measure Evaluation Guidance. April 8, 2013

Similar documents
Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Band Battery

Measure Harmonization. Guidance for Measure Harmonization A CONSENSUS REPORT

CHANGES TO NQF s HARMONIZATION AND COMPETING MEASURES PROCESS. Information for Measure Developers

3. Why is California taking the lead in this area? 4. Is this just a California initiative? Aren t ACOs important throughout the country?

Minimum Elements and Practice Standards for Health Impact Assessment. North American HIA Practice Standards Working Group

Terms of Reference IPC Compliance Review Policy and Process

Best Practice Guidelines for Developing International Statistical Classifications

INTERAGENCY GUIDANCE ON THE ADVANCED MEASUREMENT APPROACHES FOR OPERATIONAL RISK

Strategic Planning and Prioritization Update. CSAC Meeting Helen Burstin Marcia Wilson

International Standards for the Professional Practice of Internal Auditing (Standards)

Page 1 of 8. Better informed health care through better clinical guidelines

Re: Proposed Statement on Standards for Attestation Engagements, Attestation Standards: Clarification and Recodification

Georgina Verdugo, JD Office for Civil Rights U.S. Department of Health and Human Services Attention: HIPAA Privacy Rule Accounting for Disclosures

1/28/2011. Resource Use Project: Endorsing Resource Use Measures (Phase II) January 27, 2011

Workshop II Project Management

DATA-INFORMED DECISION MAKING (DIDM)

PART 1: INTRODUCTION. Purpose of the BIZBOK Guide. What is Business Architecture?

HEDIS Measures NCQA s 20 Years of Experience with Measurement

Variation in Measure Specifications: Sources and Mitigation Strategies

NETWORKS OF CENTRES OF EXCELLENCE KNOWLEDGE MOBILIZATION (NCE-KM) INITIATIVE COMPETITION GUIDE

Issues in Measuring Learning

IAASB Main Agenda (March 2016) Agenda Item. Initial Discussion on the IAASB s Future Project Related to ISA 315 (Revised) 1

National Council for Strength & Fitness

Pay-for-performance has emerged as a leading strategy

The Scientific Method

Evaluation & Decision Guides

EMT Associates, Inc. Approach to Conducting Evaluation Projects

Quality Rating System Scoring Specifications

INTERNATIONAL STANDARDS FOR THE PROFESSIONAL PRACTICE OF INTERNAL AUDITING (STANDARDS)

IAASB Main Agenda (December 2009) Agenda Item. Engagements to Compile Financial Information Issues and IAASB Task Force Proposals I.

Australian Government Auditing and Assurance Standards Board

10-B Service organizations ISAE 3402 Significant issues

2017 Kaizen Consensus Development Process: Proposed Redesign

Measure Developer Guidebook for Submitting Measures to NQF

Delivered by Sandra Fuller, MA, RHIA, FAHIMA. April 29, 2009

University Process Innovation Framework for Process Analysis

ALTE Quality Assurance Checklists. Unit 1. Test Construction

Comments to the Draft Model Guidelines by the Biotechnology Industry Organization

C-18: Checklist for Assessing USAID Evaluation Reports

A Framework for Audit Quality Key Elements That Create an Environment for Audit Quality

POLICY POSITION ON NAMING OF BIOTECHNOLOGY-DERIVED THERAPEUTIC PROTEINS. October 31, 2006

Submission of comments on COMMISSION NOTICE ON THE APPLICATION OF ARTICLES 3, 5 AND 7 OF REGULATION (EC) NO 141/2000 ON ORPHAN MEDICINAL PRODUCTS

EVALUATING PERFORMANCE: Evolving Approaches To Evaluating Performance Measurement Programs

The Examination for Professional Practice in Psychology: The Enhanced EPPP Frequently Asked Questions

POLICY GUIDELINES ON STANDARD SETTING MAY 2008

Benchmark Report. Online Communities: Sponsored By: 2014 Demand Metric Research Corporation. All Rights Reserved.

Operational Risk Management

Defining accounting structures for ecosystems and ecosystem services Discussion Paper

Pay for What Performance? Lessons From Firms Using the Role-Based Performance Scale

ALTE Quality Assurance Checklists. Unit 1. Test Construction

Change Healthcare Performance Analytics. Identify Opportunities for Financial, Clinical, and Operational Improvement

Core Values and Concepts

Intercultural Development Inventory (IDI): Independent Review

Lindsey Tighe, Taroon Amin, Adeela Khan, Helen Burstin

Re: Docket No. FDA-2015-D-4562: Draft Guidance Safety Assessment for IND Safety Reporting

Frequently Asked Questions about Integrating Health Impact Assessment into Environmental Impact Assessment

RE: HIT Policy Committee: Recommendations regarding Stage 3 Definition of Meaningful Use of Electronic Health Records (EHRs)

COMMENTARY ON PROPOSALS REGARDING SYSTEMIC COMPENSATION DISCRIMINATION

Docket #: FDA-2018-D-3268

Community Priority Evaluation (CPE) Guidelines

Proposed International Standard on Auditing 315 (Revised)

Commission notice on the application of Articles 3, 5 and 7 of Regulation (EC) No 141/2000 on orphan medicinal products (2016/C 424/03)

SRTR Technical Advisory Committee (STAC) Report to UNOS/OPTN Board

Testing: The critical success factor in the transition to ICD-10

Faculty Evaluation of Resident

Joint submission by Chartered Accountants Australia and New Zealand and The Association of Chartered Certified Accountants

Financial Accounting Standards Advisory Council September 2005

IAASB CAG Public Session (March 2016) Agenda Item. Initial Discussion on the IAASB s Future Project Related to ISA 315 (Revised) 1

NQF. A Comprehensive Framework and Preferred Practices for Measuring and Reporting Cultural Competency CULTURAL COMPETENCY A CONSENSUS REPORT

Don Rucker, M.D. National Coordinator Office of the National Coordinator for Health Information Technology 330 C Street, SW Washington, DC 20201

December 8, Dear Paul, Re: The IIRC s Discussion Paper, Towards Integrated Reporting

Helen Burstin, MD, MPH, Senior Vice President, Performance Measurement

The process of making an informed hiring decision depends largely on two basic principles of selection.

Guidance on legislation. Clinical investigations of medical devices guidance for investigators

RE: FDA-2011-N-0090: Food and Drug Administration; Unique Device Identification System

Basel Committee on Banking Supervision. Stress testing principles

National Farmers Federation

The Right Data for the Right Questions:

Toolkit for Impact Evaluation of Public Credit Guarantee Schemes for SMEs (Preliminary)

UNAIDS 2015 REPORT. operational guidelines for selecting indicators for the HIV response

Communications Note *

International Standards for the Professional Practice of Internal Auditing (Standards)

COMPETENCIES REQUIRED IN THE PROVISION OF PSYCHOSOCIAL PREPAREDNESS AND RESPONSE IN THE CONTEXT OF PANDEMIC INFLUENZA

Episode Grouper Evaluation Criteria Expert Panel Meeting

Principles and Characteristics of the COSA System. A Concise Introduction

Mapping of Original ISA 315 to New ISA 315 s Standards and Application Material (AM) Agenda Item 2-C

Draft agreed by Scientific Advice Working Party 5 September Adopted by CHMP for release for consultation 19 September

CHAPTER 2. Conceptual Framework Underlying Financial Accounting ASSIGNMENT CLASSIFICATION TABLE (BY TOPIC) Brief. Concepts for Analysis

1 Introduction. 2 Program Evaluation Standard Statements. Int'l Conf. Frontiers in Education: CS and CE FECS'16 173

vs Non-Medical Switching

NQF s Measure Prioritization and Feedback Strategic Initiatives

Drafting conventions for Auditing Guidelines and key terms for public-sector auditing

PART 1: INTRODUCTION. Purpose of the BIZBOK Guide. What is Business Architecture?

Peer Review of the Evaluation Framework and Methodology Report for Early Learning and Child Care Programs 2004

Country-led and results-based approaches;

Competence by Design-Residency Education: A Framework for Program Evaluation

Re: FDA Docket 2008N 0281: Pilot Program to Evaluate Proposed Name Submissions; Concept Paper; Public Meeting

Core Element Assessment Tool

Ref: Federal Participation in the Development and Use of Voluntary Consensus Standards and in Conformity Assessment Activities.

Draft Guidance for Industry Development and Use of Risk Minimization Action Plans

Transcription:

Composite Performance Measure Evaluation Guidance April 8, 2013

Contents Introduction... 1 Purpose... 1 Background... 2 Prior Guidance on Evaluating Composite Measures... 2 NQF Experience with Composite Performance Measures... 2 Definition of Composite Performance Measure... 4 Types of Composite Performance Measures... 4 Key Steps in Developing a Composite Performance Measure... 7 Guiding Principles... 7 Terminology... 7 Component Measures... 7 Composite Performance Measure... 8 Recommendations for Composite Performance Measure Evaluation... 9 Importance to Measure and Report... 10 Scientific Acceptability of Measure Properties... 12 Feasibility... 14 Usability and Use... 14 Comparison to Related and Competing Measures... 15 Recommendations for Review Process... 23 Next Steps... 24 Appendix A: Glossary... 27 Appendix B: Approaches for Constructing Composite Performance Measures... 31 Appendix C: References Consulted... 35 Appendix D: Technical Expert Panel and NQF Staff... 39 i

Composite Performance Measure Evaluation Guidance Introduction Healthcare is a complex and multidimensional activity. While individual performance measures provide much important information, there is also value in summarizing performance on multiple dimensions. Composite performance measures, which combine information on multiple individual performance measures into one single measure, are of increasing interest in healthcare performance measurement and public accountability applications. According to the Institute of Medicine, 1 such measures can enhance the performance measurement enterprise and provide a potentially deeper view of the reliability of the care system. Further, composite performance measures may be useful for informed decisionmaking by multiple stakeholders, including consumers, purchasers, and policy makers. Composite performance measures are complex and require a strong conceptual and methodological foundation. As with individual performance measures, the methods used to construct composites affects the reliability, validity, and usefulness of the composite measure and require some unique considerations for testing and analysis. NQF endorses performance measures that are intended for use in both performance improvement and accountability applications. Several composite measures are included in NQF s portfolio of endorsed measures, and NQF previously developed guidance 2 to assist NQF steering committees with their assessment of these measures as part of the NQF evaluation process. Since that time, however, NQF has updated the standard measure evaluation criteria and guidance for evidence, measure testing, and usability; thus, there is a need to align the evaluation criteria for composite measures with the updated guidance. Purpose The purpose of the Composite Performance Measure Evaluation Guidance Project was to review and update NQF s criteria and guidance on evaluating composite performance measures for potential NQF endorsement. Specifically, the goals of the project were to: review the existing guidance for evaluating composite performance measures; identify any unique considerations for evaluating composite performance within the context of NQF s updated endorsement criteria; modify existing criteria and guidance and/or provide additional recommendations for evaluating composite performance measures. To achieve these goals, NQF convened a 12-member Technical Expert Panel (TEP), which was comprised primarily of methodologists and other experts in the development of composite performance measures. In addition to reviewing papers describing a wide variety of evidence and experience around composite measures, and participating in several conference calls, the TEP also gathered for a one-day in-person meeting in Washington, DC on November 2, 2012. 1

Background Prior Guidance on Evaluating Composite Measures In 2008-2009, NQF initiated a project to identify a framework for evaluating composite performance measures. That developmental work included defining composite performance measures, articulating principles underlying the evaluation of composite performance measures, and developing an initial set of specific criteria (to be used in addition to NQF s standard evaluation criteria) with which to evaluate composite performance measures for potential NQF endorsement. The principles articulated for evaluating composite performance measures reflected the need for a concept of quality underlying the composite measure and justification of the methods used to construct and test the measure for reliability and validity. The criteria emphasized the need for transparency around the methodology used for composite measure construction and required that both the components of the composite and the composite measure as a whole meet NQF s measure evaluation criteria. This work served as the basis for the current project. NQF Experience with Composite Performance Measures Since 2007, 31 measures submitted to NQF for potential endorsement have been flagged as composite measures. Of these, 25 are currently endorsed. a Many of the endorsed composite measures (n=11) are derived from surveys targeted towards patients or consumers (e.g., the Consumer Assessment of Healthcare Providers and Systems (CAHPS) surveys). The remainder of the endorsed composite performance measures are comprised of all-or-none measures (n=5) and composites constructed using various methods of aggregation and weighting methodologies (n=9). As with NQF-endorsed individual measures, these composite measures are considered suitable both for performance improvement and accountability applications. However, the evaluation of composite measures for potential endorsement has not been without difficulty. The most common issues have revolved around the identification of composite measures, ambiguity in the guidance when a component measure is not NQF-endorsed, and incomplete submissions. Identifying Measures as Composites Not all composite measures that have met or potentially have met the current NQF definition of composite measures have been flagged by the measure developers as composite measures. These include all-or-none composites in which the components are assessed at the patient level (i.e., whether each patient received all of the required processes); simpler all-or-none measures requiring that linked multiple conditions are met (e.g., assess vaccination status and administer flu vaccine); and any-or-none measures that assess whether a patient has exhibited any or all of a list of complications. For such measures it is unclear whether the additional analyses indicated for composite measures (e.g., analysis of components to demonstrate alignment with the conceptual construct and contribution to the a See NQF s Quality Positioning System search tool; for a list of currently endorsed composite measures, filter based on measure type. 2

variation in the overall composite score) are applicable, and often these additional analyses have not been submitted by developers of these measures. Evaluation of Component Measures The current guidance indicates that the component measures that make up a composite measure should be NQF-endorsed or evaluated as meeting the individual measure evaluation criteria as the first step in evaluating the composite measure. However, the guidance goes on to state that while a component measure might not be important enough in its own right as an individual measure, it could be determined to be an important component of a composite. Some developers have interpreted this guidance to mean that components do not need to meet the Importance to Measure and Report criteria around evidence, impact, and performance gap. But this interpretation regarding evidence and performance gap calls into question the basis for including the component measure. Another issue related to the evidence subcriterion is whether measures of processes that are distal to desired outcomes could be included in composite measures. For example, a performance measure of merely obtaining a lab test is often considered not to meet the evidence subcriterion because it is so distal to the desired outcome and is usually based on expert opinion rather than direct evidence; however, this type of component has been suggested for inclusion in a composite measure. It also is not clear whether balancing measures that would not meet the importance criterion should be included in a composite performance measure. A balancing measure is not the main focus of interest but is used to identify or monitor potential adverse consequences of measurement. For example, for a performance measure such as 90-minute door-to-balloon time for cardiac catheterization, a balancing measure might be the rate of premature or unwarranted activation of the catheterization lab team, which is an undesirable and costly unintended negative consequence. Finally, it has been difficult to apply criteria for related and competing measures to composite measures. The challenges with measure harmonization are amplified with composite measures because, typically, more measures (from multiple developers) are involved in harmonization discussions. While using previously-endorsed measures as components in a composite measure should ameliorate most difficulties around harmonization, often the components in submitted composite measures have not been previously endorsed, as noted above. In such cases, these components either compete directly with other endorsed measures or are not harmonized with endorsed measures. Incomplete Submissions Related to Requirements for Composite Measures As discussed earlier, if measures are not flagged as composite measures, then the additional information needed to evaluate them as composite measures may not be submitted by measure developers. However, non-responsiveness to composite-specific items also has been a problem. For example, the current criteria state that the purpose/objective of the composite measure and the construct for quality must be clearly described, yet often little beyond a list of the component measures is provided. Current criteria require testing for reliability and validity of the composite measure (even if the individual measures have demonstrated reliability and validity), as well as additional analyses to justify the inclusion of component measures and the specified aggregation and weighting rules. Reliability and validity testing of the composite measure may not have been conducted. Some of the composite questions refer to correlational analyses, which may not be appropriate for all composite measures. While the current guidance recognizes this and indicates that the developer could submit other analyses 3

with rationale, these alternative analyses have not always been submitted (or if submitted, the rationale may not have been included or may not have been sufficiently explanatory). Analysis of the contribution of individual components to the composite score often has not been submitted. Without this information, NQF steering committees may be left with little more than face validity as a basis for recommending a composite performance measure. Definition of Composite Performance Measure The TEP reviewed and retained the definition provided in the initial composite report and added explicit clarification that it refers to composite performance measures. A composite performance measure is a combination of two or more component measures, each of which individually reflects quality of care, into a single performance measure with a single score. While the term composite measure also has been used to refer to multi-item instruments and scales used to obtain data from individuals about a particular domain of health status, quality of life, or experience with care (e.g., CAHPS; PHQ-9), such instruments or scales alone do not constitute a performance measure either individual or composite. However, if considered a reflection of performance, aggregated data from multi-item instruments or scales can be used as the basis of an individual performance measure, which can, in turn, be used as a component of a composite performance measure. Note that use of patient-reported outcomes (PROs) in performance measurement is the subject of a recent NQF project (see PRO report). Types of Composite Performance Measures The TEP acknowledged that a simple definition is not sufficient for determining whether a performance measure should be considered a composite for purposes of NQF measure submission, evaluation, and endorsement. The TEP considered various ways to provide further guidance. Composites often are classified according to the empirical and conceptual relationships among the component measures and between the components and the composite, the approaches for combining the individual components (e.g., all-or-none, opportunity, weighted average), or the type of individual measures included in the composite (e.g., process, outcome). The glossary in Appendix A contains definitions for various approaches to combining the component measures. Appendix B provides a description of various models of the relationships among component measures and with the overall composite. The TEP decided that a formal classification of types of composite performance measures or models would not be particularly useful and could lead to unnecessary attention to naming the approach used to construct the composite. Regardless of the approach to constructing the composite, a coherent quality construct and rationale should guide composite development, testing, analysis, and evaluation. The TEP agreed that the primary concern for NQF endorsement is whether the resulting composite performance measure is based on sound measurement science, produces a reliable signal, and is a valid reflection of quality. However, to ensure that developers know expectations in advance and submit the additional information and analyses needed to support evaluation of a composite performance measure based on the criteria and guidance provided in this report, it is necessary to clearly delineate what types of 4

measures will be considered to be composite performance measures for the purposes of NQF measure submission, evaluation, and endorsement (Box 1). Box 1. Identification of Composite Performance Measures for Purposes of NQF Measure Submission, Evaluation, and Endorsement* The following will be considered composite performance measures for purposes of NQF endorsement: Measures with two or more individual performance measure scores combined into one score for an accountable entity. Measures with two or more individual component measures assessed separately for each patient and then aggregated into one score for an accountable entity. These include: o all-or-none measures (e.g., all essential care processes received, or outcomes experienced, by each patient); or o any-or-none measures (e.g., any or none of a list of adverse outcomes experienced, or inappropriate or unnecessary care processes received, by each patient). The following will not be considered composite performance for purposes of NQF endorsement at this time: Single performance measures, even if the data are patient scores from a composite instrument or scale (e.g., single performance measure on communication with doctors, computed as the percentage of patients where the average score for four survey questions about communication with doctors is equal or greater than 3). Measures with multiple measure components that are assessed for each patient, but that result in multiple scores for an accountable entity, rather than a single score. These generally should be submitted as separate measures and indicated as paired/grouped measures. Measures of multiple linked steps in one care process assessed for each patient. These measures focus on one care process (e.g., influenza immunization) but may include multiple steps (e.g., assess immunization status, counsel patient, and administer vaccination). These are distinguished from all-or-none composites that capture multiple care processes or outcomes (e.g., foot care, eye care, glucose control). Performance measures of one concept (e.g., mortality) specified with a statistical method or adjustment (e.g., empirical Bayes shrinkage estimation) that combines information from the accountable entity with information on average performance of all entities or a specified group of entities (e.g., by case volume), typically in order to increase reliability. * The list in Box 1 includes the types of measure construction most commonly referred to as composites, but this list is not exhaustive. NQF staff will review any potential composites that do not clearly fit one of these descriptions and make the determination of whether the measure will be evaluated against the additional criteria for composite performance measures. Discussion The key feature of a composite is combining information from multiple measures into a single score; therefore, single measures or any measure that results in multiple scores are by definition not composites for purposes of NQF endorsement. The TEP was not in complete agreement regarding the identification of composites for purposes of NQF measure submission, evaluation, and endorsement. 5

Therefore, the decisions regarding identification of composite performance measures listed in Box 1 should be reviewed again after gaining more experience with composites. Measures with two or more individual performance measure scores combined into one score for an accountable entity. The TEP agreed that these are composite performance measures. Measures with two or more individual component measures assessed separately for each patient and then aggregated into one score for an accountable entity. All-or-none measures assess whether all essential care processes were received, or all outcomes were experienced, by each patient. Any-or-none measures assess whether any of a list of adverse outcomes were experienced, or inappropriate or unnecessary care processes were received, b by each patient. Although not unanimous, a majority of the TEP agreed that all-or-none and any-or-none measures should be considered composite performance measures. These measures are similar in construction in that all the components are assessed separately for each patient. The TEP discussed whether any-or-none measures that include a group of patient-specific outcomes, such as complications, should always be considered composites. For surgical patients, for example, a developer may create a measure that looks for various events that may occur as unintended consequences of the operation for each patient. These measures may include events that have not previously been considered as individual measures (e.g., hemorrhage) or events that have previously been considered as individual measures (e.g., death, readmission). In some instances, the developer may not view a measure that incorporates multiple events such as complications as an any-or-none composite (e.g., complications are viewed as a single measure instead of multiple measures). The CSAC agreed that such measures will be considered composites, with the expectation that the information needed to evaluate the composite-specific criteria is provided. However, if the developer provides a conceptual justification as to why such a measure should not be considered a composite, and that justification is accepted by the NQF steering committee, the measure can then be considered a single measure rather than a composite. Measures of multiple linked steps in one care process assessed for each patient. The TEP considered whether measures that focus on one care process (e.g. influenza immunization) but include multiple linked steps (e.g., assess immunization status, counsel patient, and administer vaccination) assessed for each patient should be considered as composites. Typically, these types of measures have not been submitted as composites; however they are constructed similarly to all-or-none measures. Usually the evidence for such measures is specific to the treatment step, in contrast to more typical all-or-none measures that focus on multiple care processes (e.g., foot care, eye care, glucose control). The majority of TEP members did not view these types of measures as composites; therefore, they will not be considered composites for purposes of NQF measure submission, evaluation, and endorsement. Performance measures of one concept specified with a statistical method or adjustment that combines information from the accountable entity with information on average performance of all entities or a specified group of entities. Some performance measures use statistical methods such as empirical Bayes shrinkage to obtain more reliable estimates (e.g., mortality, readmission). These b Any-or-none measures could conceivably include a group of treatments or services that patients should not receive to address overuse or other types of inappropriate care; however, currently there are no examples of this in the NQF portfolio. 6

statistical techniques combine information on each individual entity s performance with information on the average performance of all entities or a grouping of entities on some characteristic (e.g., by case volume). To date, most of the measures submitted to NQF that have utilized this approach have combined the provider-specific results with the overall average performance. Others have combined the provider-specific results with the average performance for their particular volume category. Some have referred to these measures as composites, others have referred to this as applying a reliability adjustment, and others just name the statistical model (i.e., Bayesian hierarchical modeling). The majority of TEP members did not view these types of measures as composites; therefore, they will not be considered composites for purposes of NQF measure submission, evaluation, and endorsement. The current NQF criteria for risk adjustment of outcomes, reliability, and validity are appropriate for evaluating these types of measures. Key Steps in Developing a Composite Performance Measure A variety of methods can be used to construct composite performance measures; however, they all involve the following key steps: 3-9 Describing the quality construct to be measured and the rationale for the composite (including its putative advantages compared with relevant individual measures); Selecting the component measures to be combined in the composite measure; Ensuring that the methods used to aggregate and weight the components supports the goal that is articulated for the measure; Combining the component measure scores, using the specified method; and Testing the composite measure to determine if it is a reliable and valid indicator of quality healthcare. Guiding Principles The following key principles were identified by the TEP and guided their recommendations and guidance regarding the evaluation criteria. Terminology As noted above, the TEP opted for a broad, generic definition of composite performance measure. The term composite measure is used in reference to individual-level instruments or scales as well as aggregate-level performance measures. NQF only endorses performance measures, not the individual-level instrument or scale. Approaches to composite measure development and construction are described using a variety of terms and can vary by discipline. Nonetheless, the construction and evaluation of composite performance measures should be based on sound measurement science and not necessarily constrained to adhere to a specific method or categorization (e.g., psychometric and clinimetric ). Component Measures The prior composite evaluation criteria required that each component measure be NQF-endorsed or meet all criteria for NQF endorsement. At times, that has been difficult to implement, particularly for 7

reliability. The TEP noted that individual measures may not be reliable independently because of rare events or small case volume, but could be used successfully within a composite because combining multiple measures can increase reliability of the composite performance measure as a whole. Rather than requiring that each component meet all NQF criteria, the TEP focused on the overall composite and identified those NQF criteria that must be met to justify inclusion of the individual component measures. The TEP agreed, however, that if an individual component measure is NQF-endorsed, then the relevant information from that endorsement process could be referenced during review of the composite performance measure. The individual component measures that are included in a composite performance measure should be justified based on the clinical evidence (i.e., for process measures, what is being measured is based on clinical evidence of a link to desired outcomes; for health outcomes, a rationale that it is influenced by healthcare). In some cases, the evidence may be for a group of interventions included in a composite performance measure, rather than each one separately. NQF-endorsement of the individual component measures should not be mandatory; however, NQF endorsement of component performance measures could satisfy some requirements for those component measures that are included in a composite (e.g., a developer would not have to demonstrate the reliability/validity of a component measure that is currently endorsed). The individual components in a composite performance measure generally should demonstrate a gap in performance; however, there may be conceptual (e.g., clinical evidence) or analytical justification (e.g., addition increases the variability/gap for the overall composite measure) for including components that do not have a gap in performance. The individual components may not be sufficiently reliable independently, but could contribute to the reliability of the composite performance measure. Composite Performance Measure The TEP emphasized the need for a coherent quality construct and rationale to guide construction of the composite as well as to guide evaluation for NQF endorsement. As several authors have noted, a construct can be viewed as the cause of individual quality indicators; that is, the quality indicators reflect or manifest the extent to which the organization has achieved quality Alternatively, the construct can be viewed as formed from the indicators The first type of relationship is called reflective and the second formative A reflective construct assumes that causality flows from the construct to the indicators, while in a formative model, causality flows from the indicators to the construct. 10 A construct has been defined as a conceptual term used to describe a phenomenon of theoretical interest The phenomena that constructs describe can be unobservable (e.g., attitudes) or observable (e.g., task performance) In either case the construct itself is an abstract term. 11 In this report, the term quality construct is applied both to reflective composites and to formative composites. Component measures should be selected based on fit with the quality construct, and analyses should support that fit. All composite performance measures share the potential for simplification by presenting one summary score instead of multiple scores for individual performance measures. However, simplification alone is not sufficient justification for a composite performance measure. Each component should fit the quality construct and provide added value to the composite. The composite performance measure should similarly provide added value relative to having individual performance measures. Composite measures are complex, with aggregation and weighting rules that do not apply to 8

individual component measures; therefore, reliability and validity of the composite performance measure score should be demonstrated. A coherent quality construct and rationale for the composite performance measure are essential for determining: o what components are included in a composite performance measure; o how the components are aggregated and weighted; o what analyses should be used to support the components and demonstrate reliability and validity; and o added value over that of individual measures alone. Reliability and validity of the individual components do not guarantee reliability and validity of the constructed composite performance measure. Reliability and validity of the constructed composite performance measure should be demonstrated. When evaluating composite performance measures, both the quality construct itself, as well the empirical evidence for the composite (i.e., supporting the method of construction and methods of analysis), should be considered. Each component of a composite performance measure should provide added value to the composite as a whole either empirically (e.g., they contribute to the reliability, or overall score) or conceptually. A related objective is parsimony. However, having a complete set of component measures from all relevant performance domains may be conceptually preferable to dropping measures that do not empirically contribute to comparative evaluation of health care entities. The individual components in a composite performance measure may or may not be correlated, depending on the quality construct. Aggregation and weighting rules for constructing composite performance measures should be consistent with the quality construct and rationale for the composite. A related objective is methodological simplicity. However, complex aggregation and weighting rules may improve the reliability and validity of a composite performance measure, relative to simpler aggregation and weighting rules. The standard NQF criteria apply to composite performance measures. NQF only endorses performance measures that are intended for use in both performance improvement and accountability applications. Recommendations for Composite Performance Measure Evaluation The NQF performance measure evaluation criteria apply to composite performance measures and their component measures. Evaluation of composite performance measures can, to a large extent, be incorporated into the standard NQF criteria and processes. NQF endorsement is not necessary for the component measures unless they are intended to be used independently to make judgments about performance. However, the individual component measures should meet specific subcriteria, such as for clinical evidence and performance gap, although there may be potential exceptions. The TEP agreed that two additional subcriteria are needed to evaluate composite performance measures; these are incorporated into the evaluation criteria in Table 1 (see 1d and 2d) and discussed below. The NQF measure submission form will be modified to accommodate the composite-specific subcriteria. 9

The TEP recommended that NQF Steering Committees include members with experience and expertise to evaluate composite performance measures against these criteria. The TEP also noted that composite methodology is still an evolving area of performance measurement that is varied and complex. Both the guidance for the identification of composite performance measures for purposes of NQF endorsement, as well as the recommended evaluation criteria, will need to be reviewed again in the future to determine if further refinements are needed. Importance to Measure and Report Evidence It is important to note the difference between the NQF criteria for evidence and validity. The evidence subcriterion is included under the Importance to Measure and Report criterion and addresses the empirical clinical evidence linking processes to desired health outcomes. In contrast, the validity subcriterion is included under the Scientific Acceptability of Measure Properties criterion and addresses whether the performance measure as constructed is an appropriate reflection of quality. The clinical evidence provides a justification for measurement and a foundation for validity, but the actual performance measure should be empirically tested to demonstrate validity because how a measure is constructed can affect whether it is an appropriate reflection of quality. Each component measure must meet the evidence subcriterion to justify its inclusion in the composite. As with individual performance measures, the evidence requirement ensures that efforts for measurement are devoted to health outcomes or processes of care that will influence desired outcomes. If a component measure is NQF-endorsed (since the updated evidence requirements were implemented), it could be considered as meeting the evidence subcriterion. If any component measure does not meet the evidence subcriterion, or does not qualify for an exception to the evidence subcriterion, then the composite would not meet the criterion for Importance to Measure and Report unless those components were removed. Evidence is required for each component measure, regardless of the approach to constructing a composite measure (i.e., all-or-none, any-or-none, or combining scores from individual performance measures). The evidence may be for a group of interventions included in a composite performance measure, rather than for each one separately (if that is how they were studied). For example, in studies that include multiple interventions delivered to all subjects in the treatment group, the effect of each intervention on outcomes cannot be disaggregated. Performance Gap As with individual performance measures, effort for measurement should be directed to where there is variation or overall poor performance. Therefore, the composite performance measure as a whole should demonstrate a performance gap. Each component measure also should generally meet the criterion of performance gap to justify its inclusion in the composite. However, the TEP acknowledged there may be circumstances when a component measure that does not meet the performance gap criterion could be included in a composite. In such cases, justification for including such a component would be required (e.g., it contributes to the reliability of the overall composite score or is needed for face validity). Quality Construct and Rationale of a Composite Performance Measure A subcriterion specific for composite performance measures is included under Importance to Measure and Report (see 1d in Table 1). This subcriterion is consistent with and refines the prior guidance 10

regarding a description of the purpose and quality construct for the composite performance measure. Quality of care is an abstract concept that is measured using observed variables. Composite measures are complex, multidimensional, and represent a higher order construct than individual measures. The composite performance measure quality construct is a concept of quality that includes: the overall area of quality (e.g., quality of CABG surgery); the included component measures (e.g., pre-operative beta blockade; CABG using internal mammary artery; CABG risk-adjusted operative mortality); the conceptual relationships between each component and the overall composite (e.g., components cause or define quality, components are caused by or reflect quality); and the relationships among the component measures (e.g., whether they are correlated or not, processes that are expected to lead to better outcomes). The TEP agreed that the rationale for constructing a composite performance measure, including how the composite provides a distinctive or additive value over the component measures individually, should be described. The TEP acknowledged that NQF endorses performance measures intended for both accountability and performance improvement and does not endorse measures for a specific accountability application (e.g., payment vs. public reporting). However, the TEP discussed that at times, the decisionmaking context could influence the composite measure construction, i.e., which component measures are included or the aggregation and weighting rules. c The decision-making context also could influence whether a composite measure is more useful than individual performance measures. d Additionally, multiple composite measures for the same or similar quality construct, even if addressing different decisionmaking motivations, will trigger an evaluation of competing measures and the rationales may be an important aspect of determining whether multiple endorsed composite performance measures are justified. Some TEP members thought the decisionmaking context is a unique aspect of composite performance measures in that the appropriate aggregation and weighting of component measures may vary according to the intended use of the composite. Other TEP members expressed concern that identifying a specific decisionmaking context might be viewed as inconsistent with NQF s current policy to endorse measures suitable for both accountability and performance improvement, rather than for a specific accountability application. Some noted that all composites should be a reflection of quality and therefore the decisionmaking process should be irrelevant. However, the rationale for the composite could include the intended decisionmaking context (e.g., to select a provider for surgery, to create payment incentives to direct resources for improvement), if that context is relevant to explaining how the composite is constructed. c For example, hospital performance on two related sets of measures ( A and B) may be important to patients, but failure on group A measures may entail additional costs to the hospital (e.g., longer mean LOS for Medicare feefor-service patients) whereas failure on group B measures may not entail such additional costs. Composites intended to inform patient choice should include both sets of measures, whereas a pay-for-performance program might use a composite limited to B measures, because the hospital already has a financial incentive to improve on A measures, and therefore the financial reward should be targeted to stimulate improvement on B measures. d For example, a composite performance measure that includes multiple surgical mortality measures may be useful for assessing overall surgical quality, whereas the individual performance measures are more useful for selecting a hospital for a specific surgical procedure. 11

Justification for the inclusion of particular component measures and the approach to composite measure construction and analysis stems from the quality construct and rationale. Therefore, the quality construct and rationale should be clearly articulated and logical in order to meet this subcriterion (1d). Importance to Measure and Report is a must-pass criterion and composite performance measures must meet all subcriteria, including 1d. Scientific Acceptability of Measure Properties NQF s criteria for endorsement currently allow for testing reliability and validity for the data elements or the performance measure score, or ideally for both. The TEP recommended that composite performance measures should be tested at the level of the composite measure score as discussed below. Some TEP members also suggested that testing at the score level should be required for all individual performance measures as well as composite performance measures. Reliability Reliability is related to the probability of misclassification, and is therefore a key issue in performance measurement. One cited advantage of composite performance measures is that using multiple indicators (components) increases reliability (i.e., the ability to detect a provider effect). 5 Individual performance measures that reflect the quality of care provided to patients by providers or institutions, when combined into a composite performance measure, should be useful in detecting a consistent pattern of practice or quality of care across patients of the provider or institution. That is, a composite performance measure is a set of measures that, taken together, are thought to reflect the quality of care, show a more consistent pattern within a provider's practice or within an institution than individual measures alone, and reveal greater differences between providers or institutions than would be expected by chance alone. Reliability testing of the composite performance measure should demonstrate that the composite measure score differentiates signal (i.e., differences in quality) from noise (i.e., random measurement error). Examples of analyses include signal-to-noise analysis, 12 interunit reliability, 13 and intraclass correlation coefficient (ICC). 14 Note that combining multiple indicators into a composite with all-or-none aggregation rules may not increase reliability of the performance measure score because the multiple indicators are essentially reduced to one data point on each patient. e Therefore, all-or-none composite performance measures should demonstrate reliability of the composite measure score just as is recommended for any composite performance measure. It is not essential that the individual component measure scores are reliable as long as the composite score itself is reliable. In some cases, an individual performance measure may not provide a reliable signal because of small volume or rare events. However, that measure could appropriately be used as a component in a reliable composite performance measure. Validity Validity testing is directed toward the inferences that can be made about accountable entities on the basis of their performance measure scores. For purposes of endorsing composite performance measures, validity testing of the constructed composite performance measure score is more important e Reliability of the performance measure is different than the rationale that all-or-none measures are intended to foster a system perspective of care, sometimes called system reliability. 12

than validity testing of the component measures. Even if the individual component measures are valid, the aggregation and weighting rules for constructing the composite could result in a score that is not a true reflection of quality. However, TEP members noted that requiring empirical validity testing of the composite as a whole could be difficult to accomplish prior to NQF endorsement. Hence, if validity of the composite performance measure is not empirically demonstrated, then acceptable alternatives at the time of initial endorsement include: systematic assessment of content or face validity of the composite performance measure, or demonstration that each of the individual component measures meet the NQF subcriterion for validity. Empirical validity testing of the overall composite measure would be expected by the time of endorsement maintenance. It may be unlikely that another valid measure ( gold standard ) of the same quality construct (i.e., a criterion measure) will be available to test the criterion validity of a composite performance measure. Therefore, validity testing of composite performance measures is likely to focus on testing various theoretical relationships. For example, a composite measure that includes multiple process measures could be tested for its association with a measure of the outcome that those processes are intended to improve. Alternatively, a composite measure might be tested for its ability to predict future outcomes. Another approach is to test the ability of the composite performance measure to differentiate performance between groups known to differ on a similar or related quality construct. Additional Testing to Support the Construction of the Composite Performance Measure A subcriterion specific for composite performance measures is included under Scientific Acceptability of Measure Properties (see 2d in Table 1). Although this is listed as a separate criterion to signify that it is specific to just composite measures, it is in reality an extension of the reliability and validity subcriteria. For example, aggregation and weighting rules are intended to maximize reliability and discrimination. Missing data rules are intended to minimize bias (which means maximizing validity). The item on missing data is probably relevant to all performance measures, but is included in this criterion because the scope of this project was limited to composite measures, and because missing data problems may be magnified when multiple measures are aggregated. This criterion is consistent with and refines the prior guidance regarding additional analyses to justify the construction of the composite measure (both component selection and aggregation and weighting rules). The original wording of the criteria for testing composites was more relevant to composite measures that are based on correlated components. The modified criterion is intended to be neutral in terms of the analyses required. For example, if the quality construct and rationale for summarizing the component measures in a composite are based on their correlation with each other, then analyses based on shared variance (e.g., factor analysis, Cronbach s alpha, item-total correlation, and mean interitem correlation) are appropriate. In such cases, very high correlations between component measures may suggest that a component is redundant and not necessary, and very low correlations may indicate that the component measures do not reflect a common underlying quality construct. Conversely, if the quality construct and rationale for summarizing the measures in a composite are not based on their correlation with each other, then analyses demonstrating the contribution of each component to the composite (e.g., change in a reliability statistic such as ICC, with and without the component measure; change in validity analyses with and without the component measure; magnitude of regression coefficient in multiple regression with composite score as dependent variable 15 ), or their clinical justification (e.g., correlation of the individual component measures to a common outcome measure) are indicated. The TEP acknowledged that empirical analyses for composites with uncorrelated 13

component measures are not as well established as those for composites with correlated component measures. However, the TEP agreed that NQF should ask for the analyses, but allow for some flexibility if developers can make a strong case as to why empirical analyses were not conducted. The unit of analysis for which performance measures are calculated is typically the service provider organization (hospital, clinic, health plan, etc.) rather than the individual patient. For such performance measures, correlational analysis such as factor analysis or internal consistency reliability should be calculated at the level of the unit rather than patient, because the unit scores are what will be reported and acted upon. Correlations at the unit level might be quite different from those at the patient level. For example, in a patient survey, some respondents might tend to give more positive (or more negative) responses across the board, creating positive correlations among items that measure entirely distinct aspects of quality. However, when data are aggregated to the provider level, these patient tendencies average out, revealing correlations among items related at the provider level. As another example, measures of cardiac surgery might include complication rates during CABG surgery, during valve repair surgery, and during valve replacement surgery; since typically any patient undergoes only one of these procedures, the patient-level correlations of these measures are not defined but correlations at the provider or hospital level are meaningful and could be examined to support the construction of a composite surgical quality measure. However, special statistical methods should be used for estimating such unit-level correlations, especially when the component measures do not have high unit-level reliability. 16-18 Although the primary purpose of this subcriterion is to justify the composite construction, parsimony and simplicity are noted as related objectives. Parsimony in regards to the number of component measures that are included and simplicity in regards to the aggregation and weighting rules help minimize burden (data collection, confusion, etc.). However, these related objectives are less important and should not compromise the conceptual integrity or reliability and validity of the composite measure. Scientific Acceptability of Measure Properties is a must-pass criterion and measures must meet both reliability and validity. In addition, composite measures must meet the additional subcriterion for the composite performance measure in order to meet the must-pass criterion of Scientific Acceptability. Feasibility The standard feasibility criterion applies to the composite measure as a whole, but must take into account all of the component measures. That is, feasibility of the composite measure will be influenced by the least feasible of the component measures. Usability and Use Composite performance measures must meet the updated criterion for Usability and Use. The TEP noted that disaggregation of a composite measure is not an absolute requirement because the individual component measures need not be independently reliable. However, at a minimum, the components of the composite performance measure must be identified with the use of the composite measure. Optimally, for purposes of improvement, the data should be collected and subsequently available to stakeholders at a granular level to facilitate investigation of the individual components. 14

Comparison to Related and Competing Measures Composite performance measures are subject to comparison to related and competing measures. If the component measures are not NQF-endorsed, they must be harmonized with endorsed measures or assessed against competing measures. Table 1. NQF Measure evaluation Criteria and Guidance for Evaluating Composite Performance Measures Measure Evaluation Criteria Conditions Guidance for Composite Performance Measures 1.Evidence, Performance Gap, and Priority Importance to Measure and Report: Extent to which the specific measure focus is evidence-based, important to making significant gains in healthcare quality, and improving health outcomes for a specific high-priority aspect of healthcare where there is variation in or overall less-than-optimal performance. Measures must be judged to meet all subcriteria to pass this criterion and be evaluated against the remaining criteria. 1a. Evidence to Support the Measure Focus The measure focus is evidence-based, demonstrated as follows: Health outcome: 3 a rationale supports the relationship of the health outcome to processes or structures of care. Intermediate clinical outcome: a systematic assessment and grading of the quantity, quality, and consistency of the body of evidence 4 that the measured intermediate clinical outcome leads to a desired health outcome. Process: 5 a systematic assessment and grading of the quantity, quality, and consistency of the body of evidence 4 that the measured process leads to a desired health outcome. Structure: a systematic assessment and grading of the quantity, quality, and consistency of the body of evidence 4 that the measured structure leads to a desired health outcome. Experience with care: evidence that the measured aspects of care are those valued by patients and for which the patient is the best and/or only source of information OR that patient experience with care is correlated with desired outcomes. Efficiency: 6 evidence not required for the resource use component. AND 1b. Performance Gap Demonstration of quality problems and opportunity for improvement, i.e., data 7 demonstrating considerable variation, or overall less-than-optimal The evidence subcriterion (1a) must be met for each component of the composite (unless NQFendorsed under the current evidence requirements). The evidence could be for a group of interventions included in a composite performance measure (e.g., studies in which multiple interventions are delivered to all subjects and the effect on the outcomes is attributed to the group of interventions). The performance gap criterion (1b) must be met for the composite performance measure as a whole. The performance gap for each component also should be demonstrated. However, if a component measure has little opportunity for improvement, justification for why it should be 15