Incorporating the Sampling Variability from an Employee Perception Survey into the Ranking Process of U.S. Government Agencies

Size: px
Start display at page:

Download "Incorporating the Sampling Variability from an Employee Perception Survey into the Ranking Process of U.S. Government Agencies"

Transcription

1 Incorporatng the Samplng Varablty from an Employee Percepton Survey nto the Ranng Process of U.S. Government Agences Taylor Lews 1 U.S. Offce of Personnel Management Abstract The Federal Employee Vewpont Survey s admnstered yearly to U.S. Federal employees by the U.S. Offce of Personnel Management to gauge a varety of factors pertanng to the worplace envronment and employee satsfacton. Responses to four thematcally-grouped subsets of atttudnal questons are combned to form ndces upon whch 37 agences are raned. The four themes are motvated by the Human Captal Assessment and Accountablty Framewor put forth by the Chef Human Captal Offcers Act of Currently, there s no formal procedure to ncorporate the samplng error nherent n the ndces when assgnng rans. Instead, an ad-hoc roundng technque has hstorcally been used. Ths paper proposes two methods to approxmate ndex varablty, one based on Taylor seres lnearzaton and one based on replcaton. Two alternatve ran assgnment procedures utlzng these measures of ndex varablty are ntroduced. No major dfferences were found between the two varance approxmaton technques, but evdence from the alternatve ran assgnment procedures suggests the current method may be mposng ran boundares where, statstcally speang, they do not belong. Key Words: ranng, lnearzaton, replcate weghts, Federal Employee Vewpont Survey, FEVS 1. Introducton The Federal Employee Vewpont Survey (FEVS) 2 s admnstered yearly by the U.S. Offce of Personnel Management (OPM) to measure Federal employees atttudes and perceptons across a varety of dmensons characterstc of a satsfed, engaged, and productve worforce. In FEVS 2011, 560,084 Federal employees spannng more than 80 agences were ased to partcpate n an 98-tem, prmarly Web-based survey. A sngle-stage, stratfed sample desgn was employed wth strata defned by the cross-classfcaton of agency component and supervsory status. Agency component s the frst major dvson or branch wthn an organzaton. For nstance, whle the U.S. Department of Commerce s consdered an agency, the Census Bureau and Patent and Trademar Offce are two of ts components. Supervsory status conssted of three categores: non-supervsors, supervsors, and executves. At survey close, 266,376 responses had been obtaned, correspondng to a response rate of about 48% (AAPOR RR4) (Amercan Assocaton of Publc Opnon Research, 2009). To mtgate potental bases stemmng from unt nonresponse, varous auxlary varables resdng on the sample frame, a centralzed personnel database mantaned by OPM, were used n a three-stage weghtng procedure. Examples of these varables are employee gender, race and ethncty, and length of servce. Even wth the drop-off attrbutable to nonresponse, the rato of respondng employees to the number of the employees n the populaton was often non-neglgble wthn a stratum. As a result, stratum-specfc fnte populaton 1 The opnons, fndngs, and conclusons expressed n ths paper are those of the author and do not necessarly reflect those of the U.S. Offce of Personnel Management. 2

2 correctons (FPCs) are ncorporated nto varance estmates. In sum, FEVS contans three features of complex survey data: stratfcaton, unequal respondent weghts, and FPCs. Most FEVS queston response alternatves tae the form of a fve-pont Lert scale, e.g., rangng from Completely Agree to Completely Dsagree, sometmes wth an explct Do Not Know (DNK) or No Bass to Judge (NBTJ) opton gven. No tems are mandatory, so there s some degree of tem nonresponse, but rates are mnor typcally less than 5%. Statstcal tests are frequently conducted after dchotomzng an tem nto ether a postve or non-postve response, wth a DNK/NBTJ electon treated as mssng. If we let y denote a 0/1 ndcator varable of a postve response for the th respondent to the th tem and w denote the nonresponse-adjusted weght affxed to that respondent, a weghted percent postve estmate s defned as where R p ˆ 100 w R R w y sgnfes the set of substantve (non-mssng) responses to the th tem. A subset of the 84 non-demographc tems s used to form four ndces upon whch 37 cabnet-level and large, ndependent agences are raned. Motvated by the Human Captal Assessment and Accountablty Framewor (HCAAF) establshed n the Chef Human Captal Offcers Act of 2002, the ndces cover the followng four themes: (1) Leadershp and Knowledge Management; (2) Results-Orented Performance Culture; (3) Talent Management; and (4) Job Satsfacton. Each agency s ndex s computed as the arthmetc mean of a set of weghted percent postve estmates, or Iˆ 1 K K 1 ˆ p where K s the number of tems comprsng the ndex. As an example, the Talent Management ndex s the smple average of K = 7 tems weghted percent postves: 1, 11, 18, 21, 29, 47, and 68. The current ran assgnment procedure begns by computng each agency s ndex and roundng to the nearest whole number. The 37 rounded ndces are then raned by orderng from hghest to lowest. Tes, whch occur when two or more agences share the same rounded ndex, are resolved by assgnng all agences nvolved the hghest avalable ran. For nstance, f two agences share the hghest rounded Talent Management ndex, both are assgned the frst ran and the next hghest-scorng agency could be, at best, raned thrd. Dependng upon how two agences ndces round, a ran boundary can appear anywhere between a 0 and 1 percentage pont dfference. An obvous crtcsm s that no attempt s made to ncorporate the samplng error underlyng the ndces. For example, any ndex dfference larger than 1 s necessarly declared sgnfcant, regardless of ts estmated varablty. The next secton detals two methods proposed to quantfy ndex varablty. Also dscussed are two alternatve ran assgnment procedures ncorporatng ths varablty. The frst s based on a sequence of statstcal hypothess tests, whereas the second draws upon deas of bootstrappng (Efron and Tbshran, 1993) suggested by Barer et al. (2005). Results are gven n Secton 3, and Secton 4 concludes wth a bref dscusson.

3 2. Methods 2.1 Varance Approxmaton The frst step toward ncorporatng the samplng varablty of an agency s ndex s to derve a vald desgn-based estmate of t. One commonly used tool for complex survey data nferences s Taylor seres lnearzaton Taylor Seres Lnearzaton For quanttes expressble as a functon of T estmated totals, Woodruff (1971) demonstrated a technque to greatly smplfy the Taylor seres calculatons. To llustrate, consder a sngle weghted percent postve 3 estmate pˆ R R w y w Tˆ Tˆ 1 2 whch s a functon of T = 2 totals. Woodruff showed that by a mere exchange of summaton terms, var( pˆ ) could be approxmated by frst constructng a new varable u T 2 t1 pˆ T t tˆ t where tˆ t denotes the th prmary samplng unt s (PSU) estmated total for the t th argument n the functon, then fndng var( u ) wth respect to the sample desgn. In other words, one creates a PSU-level varate equalng the sum of the functon s partal dervatves wth respect to each total tmes the PSUlevel estmate of that partcular total. The varance of the sum of ths varate serves as the estmated T varance of the functon. The ey element of smplcty s that one bypasses the need to compute 2 covarances. After a lttle algebra, t can be shown u R 1 ( w y w R R w y w w ) for the th weghted percent postve estmate. Snce the ndces are smply an average of K weghted percent postve estmates, the lnearzed varate s the followng generalzaton of the process above: 3 Wthout loss of generalty, we smplfed notaton by droppng the x 100 term, so ths s techncally a proporton as opposed to a percent.

4 u 1 K R 1 ( w y w 1 w y w ) ( w y w ) 1 K R1 1 RK 1 K w w w R R R 1 K w y K Jacnfe Repeated Replcaton Replcaton technques operate under an alternatve paradgm to approxmate varances. Although there are varous forms, most are smlar n sprt n that they nvolve drawng a sequence of samples from the analyss data set. In the present research project, jacnfe repeated replcaton (JRR) (see Secton 6 of Rust, 1985) was pursued. The dea behnd the delete-a-group JRR technque s to calculate a seres of estmates for a partcular quantty by systematcally droppng each PSU, n turn, and weghtng up the remanng PSUs. Operatonally, ths can be accomplshed by appendng a seres of replcate weghts to the analyss fle. The number of replcate weghts and, thus, the number of jacnfe replcate estmates, equals the number of PSUs. A nce feature of the technque s that the varance approxmaton formula s a straghtforward functon of the replcate estmates varablty, regardless of the quantty. Gven that the FEVS employs a sngle-stage, stratfed sample desgn, each of the 266,376 respondents comprses hs or her own dstnct PSU. Therefore, the pure jacnfe procedure would result n as many replcate weghts. To crcumvent ths cumbersome stuaton, varous methods of collapsng strata, groupng PSUs, or a combnaton of the two are used n practce (cf. Appendx D of Westat, 2007). Inferences based on the smplfed structure are stll unbased, only result n fewer degrees of freedom. To ths end, the 266,376 respondents were grouped nto 84 samplng error computaton unts (SECUs) spread over 10 samplng error computaton strata (SESTRATs), usng the termnology of Heernga, West, and Berglund (2010). SECUs were formed by groupng ndvduals wthn the same stratum or a stratum characterzed by a smlar rato of respondents to populaton unts. As wll become evdent below, ths allowed for a replcate-specfc, pseudo-fnte Populaton Correcton (pseudo-fpc) to be ncorporated. Each of the 84 replcate weghts was created by assgnng ether a value of (1) zero, for all observatons wthn the omtted SECU; (2) the full-sample weght tmes n h /(n h 1), where n h equals the number of SECUs n the omtted SECU s SESTRAT, for observatons n other SECUs wthn the same SESTRAT; or (3) the orgnal full-sample weght, for observatons n SECUs outsde the omtted SECU s SESTRAT. Each replcate weght was fully nonresponse-adjusted by the same three-step process used to create the full-sample weght. In all, 85 weghts were produced one full-sample weght and 84 replcate weghts and an ndex estmate was calculated usng each. If we denote the full-sample-weghted estmate for a partcular ndex as Î and a replcate-weghted estmate by Î r (r = 1,, 84), the JRR varance estmate was calculated as var JRR 84 nh 1 ( Iˆ) *( pseudo FPC)*(ˆ I r 1 n h r Iˆ) Ran Assgnment Procedures Sequental Hypothess Testng As before, the sequental hypothess testng approach began by orderng the 37 agences ndex scores from hghest to lowest. Ran boundares, however, were mposed based on a more formal statstcal sgnfcance test on the dfference between two adjacent estmates. Gven samplng was conducted

5 ndependently wthn agency, all estmates were assumed ndependent from one another. Sgnfcance was thereby ascertaned by comparng the ndex dfference dvded by the square root of the sum of the respectve agency-specfc varance estmates to a reference normal dstrbuton the degrees of freedom were assumed sutably large to use the z approxmaton to the student t dstrbuton. If the quotent was greater than 1.96, the two ndces were deemed sgnfcantly dfferent and a ran boundary was mposed between them; otherwse, the two agences shared the hghest avalable ran. One potental consequence of the parwse testng of adjacently-ordered estmates s that a sequence of tests could all prove nsgnfcant, even though a hgher-scorng agency wthn the sequence s actually sgnfcantly dfferent from a lower-scorng agency. Ths scenaro s depcted by Fgure 1 below. One can observe that Agency 1 s ndex s not sgnfcantly greater than Agency 2 s, nor s Agency 2 s sgnfcantly greater than Agency 3 s, but Agency 1 s s sgnfcantly greater than Agency 3 s. When ths occurs, the procedure assgns the frst two agences the hghest avalable ran and mposes a ran boundary between Agency 2 and Agency 3. Agency 3 would then be compared to Agency 4 to determne whether they share a ran or whether another ran boundary should be mposed between them. A nce property of ths modfcaton to the algorthm s that t greatly reduces the lelhood that two agences sharng a ran are statstcally ndstngushable from one another wth respect to the gven ndex. Agency Index Ran Agency Not Sgnfcant Agency Sgnfcant 1 Not Sgnfcant Agency Fgure 1: Schematc Representaton of Sequental Hypothess Testng Ran Assgnment Procedure. For each of the four HCAAF ndces, the sequental hypothess testng approach was mplemented twce, once for each varance approxmaton method Parametrc Bootstrap Barer et al. (2005) dscuss an nterestng approach to ranng the 50 U.S. states and the Dstrct of Columba n terms of mmunzaton coverage rates estmated from the Natonal Immunzaton Survey. They propose a parametrc bootstrap (Efron and Tbshran, 1993) technque to assess uncertanty n assgnng rans. If we defneˆ d to be the survey estmate for the d th doman (e.g., a partcular U.S. state) out of a total of D ndependent domans and se( ˆ d ) ts standard error, the b th bootstrap estmate s defned as ˆb d ˆ d z b d * se( ˆ ) d where b z d s a random normal devate. The dea s to repeat the procedure many tmes, ndependently for all D domans, each tme ranng the b d ˆ s. A ran confdence nterval can be ascertaned from the dstrbuton of rans across all B bootstrap samples. For nstance, the endponts of a 90% confdence nterval for the d th doman would be defned by the 5 th and 95 th percentles from ts respectve dstrbuton of rans.

6 The adaptaton to ranng ndces calculated from FEVS s straghtforward, wth the excepton that two competng standard error estmates exst. For completeness, the bootstrap procedure was appled usng the two estmates derved va Taylor seres lnearzaton or JRR, each tme wth B = 10,000 bootstrap samples. 3. Results The frst research objectve was to compare the two varance approxmaton methods. Fgure 2 conssts of four boxplots, each allowng for a vsualzaton of the dstrbuton of ratos of estmated varances JRR to Taylor seres for a gven ndex across all 37 agences. Although there s a far amount of dsperson, the means and medans of the dstrbutons resde near 1, ndcatng the two technques are more or less equvalent. If anythng, the JRR estmates appear slghtly hgher: the overall mean rato, averaged across all four ndces, s The medan s A possble explanaton s that the full, mult-stage weghtng procedure was performed on all JRR replcates (Vallant, 2004), whereas the Taylor seres lnearzaton method mplctly assumes a nown, fxed weght. Dfferences are not large enough or consstent enough to conclude any substantve dfference between the two methods. For brevty, hereafter we only consder the varance estmates based on Taylor seres lnearzaton.

7 Fgure 2: Dstrbuton of the Rato of JRR to Taylor Seres Lnearzaton Varance Estmates for each Index across all 37 Agences. The next research objectve was to compare the current procedure s rans wth rans resultng from the sequental hypothess testng procedure. Fgure 3 contrasts the two sets of rans va four scatterplots, one for each ndex. The x-axs represents the current rans, whereas the alternatves are represented by the y- axs. Were the two rans equvalent, we would expect all ponts to le, more or less, along a 45-degree angle lne through the orgn. Ths s clearly not precsely the case, although the two sets of rans are not wldly dvergent. Indeed, the medan ran change across all four ndces relatve to the current method s 0. Most jostlng occurs around the mddle; the more extreme rans show less varablty. In fact, for all four ndces, the coveted frst ran unequvocally belongs to the same agency, the Nuclear Regulatory Commsson.

8 Fgure 3: Scatterplots of Sequental Hypothess Testng Rans (Usng Taylor Seres Lnearzaton) versus the Current Method s Rans, by Index. An nterestng fndng not mmedately detectable from Fgure 3 s that the current method tends to assgn more unque rans. For example, the average number of unque rans across the four ndces s 17.5 under the current method, whereas the same average s 12 under the sequental hypothess testng method usng ether varance approxmaton method. To understand why ths occurs, we mght consder a typcal ndex dfference requred for sgnfcance to be declared by frst notng the average Taylor seres estmate of ndex varablty s Worng bacwards from a two-sample sgnfcance test assumng ndependence and a normal approxmaton, ths typcal dfference would be1.96* 2* In contrast, recall the current method declares any ndex dfference greater than 1 as sgnfcant, and so the comparable typcal dfference would be even smaller. Taen together, these fndngs suggest the current method may be mposng ran boundares where, statstcally speang, they should not appear.

9 Ran of Talent Management Index Next, the focus s shfted to the parametrc bootstrap technque and the dea of usng percentles of rans across B bootstrap samples to establsh and report ran ranges. Fgure 4 presents these n the form of a stoc prce hgh/low/close plot. The hgh pont s the 95 th percentle ran for the Talent Management ndex across all B = 10,000 bootstrap samples, whle the low pont s the 5 th percentle and the close pont represents the current method ran. A few ponts worthy of dscusson can be made from Fgure 4. The only agency wth zero ran varablty s the hghest-ranng one, the Nuclear Regulatory Commsson, further evdence of ts unrvaled ndex superorty. Though not shown, ths also occurs for the other three HCAAF ndces. The fgure also renforces the earler fndng that more varablty exsts wthn the mddle rans. Barer et al. (2005) concluded a smlar observaton (see Fgure 1 on p. 609). Lastly, there s a tendency for the current method s ran to appear toward the lower bound of the range. Ths corresponds to a hgher ran, and s expected consderng how tes are handled both share the hghest avalable ran Agency Fgure 4: 90% Parametrc Bootstrap Ran Ranges (Usng Taylor Seres Lnearzaton) for Agency Estmates of the Talent Management Index, Overlad wth the Current Method Ran (Denoted by a Trangle). 4. Dscusson Ths paper began by dervng two desgn-based technques to approxmate the varablty of four ndces estmated usng data from FEVS These ndces are used to ran 37 dstnct U.S. government agences across four dmensons specfed by the Human Captal Assessment and Accountablty Framewor. Although the two were comparable, varance approxmatons based on replcaton tended to yeld slghtly larger estmates than those based on Taylor seres lnearzaton, whch was deemed most lely a byproduct of the full, three-stage weghtng procedure beng appled ndependently on all replcates.

10 Acnowledgng that the currently-employed ran assgnment procedure maes no use of the underlyng samplng varablty n the ndex estmates, ths paper then ntroduced two alternatve ranng procedures and presented an emprcal evaluaton of how they dffer from the current method. The frst was based on a sequence of statstcal sgnfcance tests, whereas the second drew upon deas of bootstrappng to derve a ran range, or ran confdence nterval of sorts. Evdence from the sequental testng approach suggests the current method may be dfferentatng rans across two or more agences whose ndex estmates are actually statstcally ndstngushable from one another. The second alternatve procedure was based on the parametrc bootstrap approach orgnally proposed by Barer et al. (2005). Whle the technque s ntutve, t s antcpated that ts adopton may be hndered by nterpretatve dffcultes of lay users (of the rans), who are less famlar wth such sophstcated nferental technques and have come to expect a sngle ran reported. One common observaton for ether method s that ran uncertanty s greater for agences stuated around the mddle of the ran dstrbuton than at ether extreme. One notable manfestaton of ths was how the hghest-ranng agency across all four ndces, the Nuclear Regulatory Commsson, exhbted zero varablty n ran ranges derved from the parametrc bootstrap method. The current study s not wthout lmtatons. Whle much attenton was gven to quantfyng samplng error, there could be resdual nonresponse or measurement error lngerng n ndex estmates as well. Despte the multple sgnfcance tests that were conducted, no attempt was made to control overall Type I error rate. A future research project could ntroduce modfcatons to the sequental hypothess testng approach to do so. The two alternatves nvestgated are certanly not an all-nclusve set of sensble methods to ncorporate the underlyng samplng error when ranng pont estmates derved from surveys. For example, Stoneberg (2005) proposed a technque to construct ran ranges based on a sequence of comparsons as to whether a gven doman estmate s confdence regon overlaps wth all other doman estmates confdence regons. As Schener and Gentleman (2001) note, however, ths technque can be overly conservatve unless one varance estmate happens to be much larger than the other. Thnng of the problem from a Bayesan perspectve mght also be helpful n craftng an approach. Lastly, whle there are no compellng reasons not to assume the ndces would follow a normal dstrbuton, a nonparametrc adaptaton, such one of the several proposed by Rao and Wu (1988), mght be better suted for sample desgns wth fewer degrees of freedom or alternatve estmators whose dstrbutons are not as well behaved. References The Amercan Assocaton for Publc Opnon Research. (2009). Standard Defntons: Fnal Dspostons of Case Codes and Outcome Rates for Surveys. (6th ed.) AAPOR. Retreved February 21, 2010: rddefntons2009new.pdf Barer, L., Smth, P., Gerzoff, R., Luman, E., McCauley, M., and Strne, T. (2005). Ranng States Immunzaton Coverage: an Example from the Natonal Immunzaton Survey, Statstcs n Medcne, 24, pp Efron B., and Tbshran, R. (1993). An Introducton to the Bootstrap. New Yor, NY: Chapman & Hall. Heernga, S., West, B., and Berglund, P. (2010). Appled Survey Data Analyss. Boca Raton, FL: Chapman & Hall/CRC Press. Rao, J., and Wu, C. (1988). Resamplng Inference wth Complex Survey Data, Journal of the Amercan Statstcal Assocaton, 83, pp

11 Rust, K. (1985). Varance Estmaton for Complex Estmators n Sample Surveys, Journal of Offcal Statstcs, 1, pp Schener, N., and Gentleman, J. (2001). On Judgng the Sgnfcance of Dfferences by Examnng the Overlap Between Confdence Intervals, The Amercan Statstcan, 55, pp Stoneberg, B. (2005). Please Don t Use NAEP Scores to Ran Order the 50 States, Practcal Assessment, Research, & Evaluaton, 10, pp Vallant, R. (2004). The Effect of Multple Weghtng Steps on Varance Estmaton, Journal of Offcal Statstcs, 20, pp Westat. (2007). WesVar 4.3 User s Gude. Retreved Aprl 9, 2012: Woodruff, R. (1971). A Smple Method for Approxmatng the Varance of a Complcated Estmate, Journal of the Amercan Statstcal Assocaton, 66, pp