Examining Turnover in Open Source Software Projects Using Logistic Hierarchical Linear Modeling Approach

Similar documents
Multilevel Modeling Tenko Raykov, Ph.D. Upcoming Seminar: April 7-8, 2017, Philadelphia, Pennsylvania

Appendix A Mixed-Effects Models 1. LONGITUDINAL HIERARCHICAL LINEAR MODELS

Multilevel Modeling and Cross-Cultural Research

Getting Started with HLM 5. For Windows

Hierarchical Linear Modeling: A Primer 1 (Measures Within People) R. C. Gardner Department of Psychology

Assessing the performance of food co ops in the US

ijcrb.webs.com INTERDISCIPLINARY JOURNAL OF CONTEMPORARY RESEARCH IN BUSINESS An Empirical study on talent retention strategy by BPO s in India

A SAS Macro to Analyze Data From a Matched or Finely Stratified Case-Control Design

The Big Picture: Four Domains of HLM. Today s Discussion. 3 Background Concepts. 1) Nested Data. Introduction to Hierarchical Linear Modeling (HLM)

Unit 6: Simple Linear Regression Lecture 2: Outliers and inference

A MULTILEVEL ANALYSIS OF TEAM CLIMATE AND INTERPERSONAL EXCHANGE RELATIONSHIPS AT WORK

RESIDUAL FILES IN HLM OVERVIEW

An Examination of the Factors Influencing the Level of Consideration for Activity-based Costing

The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics 7.5, pa

Application of Survival Model to Understand Open Source Software Release

The Diffusion of Books

Changing aspirations through poverty measurement: The Poverty Stoplight Program

COMPARING THE PREDICTIVE POWER OF NATIONAL CULTURAL DISTANCE MEASURES: HOFSTEDE VERSUS PROJECT GLOBE

Product Category-Level and Shopping Trip-Level Drivers of In-Store Decision Making

results. We hope this allows others to better interpret and replicate our findings.

Module 7: Multilevel Models for Binary Responses. Practical. Introduction to the Bangladesh Demographic and Health Survey 2004 Dataset.

Chapter Six- Selecting the Best Innovation Model by Using Multiple Regression

EFFICACY OF ROBUST REGRESSION APPLIED TO FRACTIONAL FACTORIAL TREATMENT STRUCTURES MICHAEL MCCANTS

Examination of Cross Validation techniques and the biases they reduce.

MISSING DATA TREATMENTS AT THE SECOND LEVEL OF HIERARCHICAL LINEAR MODELS. Suzanne W. St. Clair, B.S., M.P.H. Dissertation Prepared for the Degree of

Department of Economics, University of Michigan, Ann Arbor, MI

Foley Retreat Research Methods Workshop: Introduction to Hierarchical Modeling

Small Business advice seeking behaviour technical report. An analysis of the 2018 small business legal need survey July 2018

Hierarchical Linear Models For Longitudinal Data August 6-8, 2012

Introduction to Hierarchical Linear Models ICPSR 2009

Develop Innovative Methods in Secondary Analyses of Child Welfare Databases -- Children s Bureau Discretionary Grants Program Grantee s Final Report

STATISTICS PART Instructor: Dr. Samir Safi Name:

What Is Conjoint Analysis? DSC 410/510 Multivariate Statistical Methods. How Is Conjoint Analysis Done? Empirical Example

5 CHAPTER: DATA COLLECTION AND ANALYSIS

Haplotype Based Association Tests. Biostatistics 666 Lecture 10

Chapter 3. Database and Research Methodology

FACTORS INFLUENCING MICRO AND SMALL ENTERPRISES ACCESS TO FINANCE IN BOTSWANA

The Impact of Project License and Operating System on the Effectiveness of the Defect-Fixing Process in Open Source Software Projects

The Impact of Individual Characteristics and Branch Characteristics on Housing Agent Performance: Heckit model and Hierarchical Linear Modeling

Linear Relationship between CO 2 Emissions and Economic Variables: Evidence from a Developed Country and a Developing Country

1. Measures are at the I/R level, independent observations, and distributions are normal and multivariate normal.

Understanding the Factors Influencing the Value of Person-to-Person Knowledge Sharing

The US dollar exchange rate and the demand for oil

Harbingers of Failure: Online Appendix

FACTORS INFLUENCING THE AUDITOR S GOING CONCERN OPINION DECISION

Effects of Culture and Service Quality on Affective Service Experience Quality of Guests

Labour Economics 18 (2011) Contents lists available at ScienceDirect. Labour Economics. journal homepage:

Problem Points Score USE YOUR TIME WISELY SHOW YOUR WORK TO RECEIVE PARTIAL CREDIT

PREDICTION OF CRM USING REGRESSION MODELLING

How Do Car-Sizes Influence the Booking?

An Analysis of Cointegration: Investigation of the Cost-Price Squeeze in Agriculture

Wen-Bin Song, Department of Insurance & Financial Management, Takming University of Science and Technology, Taiwan ABSTRACT INTRODUCTION

Treatment of Influential Values in the Annual Survey of Public Employment and Payroll

Predicting Customer Churn Using CLV in Insurance Industry

A multilevel approach to geography of innovation

Chapter 5 TOPVEC: A Critical Success Framework for COTS Based Software Development

THE IMPACT OF E-TRAINING ON HR RETENTION IN MIDSIZED FIRM

Predictors of Green IT Adoption: Implications from an Empirical Investigation

Modeling occupancy in single person offices

WHY do some organizations adopt electronic. Why Do Firms Adopt E-Procurement Systems? Using Logistic Regression to Empirically Test a Conceptual Model

THREE LEVEL HIERARCHICAL BAYESIAN ESTIMATION IN CONJOINT PROCESS

MEANS AND ENDS: A COMPARATIVE STUDY OF EMPIRICAL METHODS FOR INVESTIGATING GOVERNANCE AND PERFORMANCE

Multilevel Models. David M. Rocke. June 6, David M. Rocke Multilevel Models June 6, / 19

Validation of a new LINOR Affective Commitment Scale

R-SQUARED RESID. MEAN SQUARE (MSE) 1.885E+07 ADJUSTED R-SQUARED STANDARD ERROR OF ESTIMATE

Brand Equity and Factors Affecting Consumer s Purchase Intention towards Luxury Brands in Bangkok Metropolitan Area

Influence of Human Resource Management Practices on the Performance of Small and Medium Enterprises in Kisumu Municipality, Kenya

The Structure of Wages and Internal Mobility

A Smart Approach to Analyzing Smart Meter Data

What is DSC 410/510? DSC 410/510 Multivariate Statistical Methods. What is Multivariate Analysis? Computing. Some Quotes.

A Statistical Analysis of Water Conservation Policies

Correlation and Simple. Linear Regression. Scenario. Defining Correlation

Effects of E-recruitment and internet on recruitment process: An Empirical study on Multinational companies of Bangladesh

Using Stata 11 & higher for Logistic Regression Richard Williams, University of Notre Dame, Last revised March 28, 2015

Supplier Perceptions of Dependencies in Supplier Manufacturer Relationship

Determinants of salary growth in Shenzhen, China: An analysis of formal education, on-the-job training, and adult education with a three-level model

Time Series Analysis in the Social Sciences

Testing the Predictability of Consumption Growth: Evidence from China

Explaining Variation in the Effects of Welfare-to-Work Programs

THE RELATION BETWEEN JOB CHARACTERISTICS AND QUALITY OF WORKING LIFE: THE ROLE OF TASK IDENTITY TO EXPLAIN GENDER AND JOB TYPE DIFFERENCES

Multiple Regression: A Primer

The Interactive Effects of Recruitment Practices and Product Awareness on Job Seekers Employer Knowledge and Application Behaviors

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

Biophysical and Econometric Analysis of Adoption of Soil and Water Conservation Techniques in the Semi-Arid Region of Sidi Bouzid (Central Tunisia)

Cross-level effects of support climate: Main and moderating roles

EMPIRICAL ANALYSIS OF INTERNET-ENABLED MARKET TRANSPARENCY: IMPACT ON DEMAND, PRICE ELASTICITY, AND FIRM STRATEGY

Predicted vs. Estimated Welfare Measures: A Test of the Benefits Transfer Method 1

Industrial organisation and procurement in defence firms: An economic analysis of UK aerospace markets.

Multilevel Mixed-Effects Generalized Linear Models. Prof. Dr. Luiz Paulo Fávero Prof. Dr. Matheus Albergaria

A Comparative Study on Rickshaw Fare and Rickshaw Pullers Income between. Trishal and Mymensingh Municipality

Cross-Level Relationship of Implemented High Performance Work System and Employee Service Outcomes: The Mediating Role of Affective Commitment

Sensitivity Analysis of Nonlinear Mixed-Effects Models for. Longitudinal Data That Are Incomplete

Wage Mobility within and between Jobs

MEMBERS OF EMPLOYMENT EQUITY GROUPS: PERCEPTIONS OF MERIT AND FAIRNESS IN STAFFING ACTIVITIES

M. Zhao, C. Wohlin, N. Ohlsson and M. Xie, "A Comparison between Software Design and Code Metrics for the Prediction of Software Fault Content",

An Exploratory Study on the Relationship Between OSS Project Popularity and Network Characteristics

Republic of the Philippines LEYTE NORMAL UNIVERSITY Paterno St., Tacloban City. Course Syllabus in FD nd Semester SY:

Submitted Manuscript. The Demand of Part-time in European Companies: A Multilevel Modeling Approach

Testing for Seasonal Integration and Cointegration: The Austrian Consumption Income Relationship LIYAN HAN GERHARD THURY. Shima Goudarzi June 2010

Managerial compensation in multi-division firms

Transcription:

Examining Turnover in Open Source Software Projects Using Logistic Hierarchical Linear Modeling Approach Pratyush N Sharma 1, John Hulland 2, and Sherae Daniel 1 1 University of Pittsburgh, Joseph M Katz Graduate School of Business, 229 Mervis Hall, Pittsburgh, 15232, USA {pns9,sld54}@pitt.edu 2 University of Georgia, Terry College of Business,104 Brooks Hall, 310 Herty Drive, Athens, 30602, USA jhulland@uga.edu Abstract. Developer turnover in open source software projects is a critical and insufficiently researched problem. Previous research has focused on understanding the developer motivations to contribute using either the individual developer perspective or the project perspective. In this exploratory study we argue that because the developers are embedded in projects it is imperative to include both perspectives. We analyze turnover in open source software projects by including both individual developer level factors, as well as project specific factors. Using the Logistic Hierarchical Linear Modeling approach allows us to empirically examine the factors influencing developer turnover and also how these factors differ among developers and projects. Keywords: Open Source Software, Turnover, Logistic Hierarchical Linear Modeling. 1 Introduction Developer turnover in open source software (OSS) projects is a nontrivial issue because of the frequency with which it occurs and the difficulties new developers face in contributing to a project. Robles and Gonzales-Barahona [8] analyzed the evolution of some popular OSS projects (such as GIMP, Mozilla etc.) over a period of 7 years and found that these projects suffered from yearly turnover in core development teams and had to rely heavily on regeneration. Turnover is a critical problem in software development projects because it can lead to schedule overruns [1] and regenerating teams is a complicated issue [7]. A majority of the OSS research concerns itself with a developer s motivation to contribute to OSS development [2; 3]. Prior studies have tended to focus on the explanation of developer activity levels using either the individual perspective [2; 3] or the project perspective [10]. However, since OSS participants are embedded in projects it is important to relate characteristics of individuals and the characteristics of projects in which they function. Disaggregating all project level variables in an individual level analysis may lead to the violation of the assumption of independence of observations, since all developers will have the same value on each of the project variables. On the other hand, aggregating developer level variables to a project level analysis may lead to unused within group information [6]. None of the research studies have attempted to model turnover behavior in OSS in a comprehensive fashion taking into account

2 Pratyush N Sharma 1, John Hulland 2 and Sherae Daniel 1 both the developer level and project level factors. In order to address these limitations and expand the existing research, this exploratory study develops a model of turnover behavior in OSS by focusing on two levels: the developer level, which examines factors that may affect developers decisions to become inactive, and the project level, which examines the factors that may influence the rates of turnover among projects. 2 Methodology To explore and explain the nature and impact of a developer and project variables on turnover, we used archival data. The sample of projects and participants was drawn from SourceForge (www.sourceforge.net). The sample contained data for 40 currently active projects on SourceForge and 201 developers. 2.1 Developer Level Variables The following five developer level (level 1) variables, including the outcome variable, turnover, were collected Turnover Turnover was operationalized as a binary outcome variable. A developer was deemed active, coded as 0, if at least one CVS/SVN commit was made by him/her in a 2 month period; otherwise coded as 1. Joyce and Kraut [10] also followed a similar approach in their study of turnover from online newsgroups, however they chose an observation period of six months to determine turnover. Role of the Developer A project may employ developers for various roles that range in the level and kind of expertise required 1. We created two dummy variables Developer and Admin with the base group Other (which included all other roles) 2. Number of Projects The number of OSS projects undergoing active development that the developer was involved in. Past Activity Level Past activity was operationalized as a binary variable 3. A developer was deemed active in the past, coded as 0, if at least one CVS/SVN commit was made by him/her in the previous 10 month period; otherwise, we coded it as 1. Tenure We approximate the tenure of a developer in months by using the date of joining SourceForge.net. 2.2 Project Level Variables The following project level variables (level 2) were collected Project Age The date the project was registered is available on SourceForge. We calculate the age in number of months since its registration on SourceForge. 1 Some examples of roles developers may perform in the project are as administrators, developers, document writers, project managers, packagers, web designers, etc. 2 Roughly 25% roles belonged to the Other category. Since Developer and Admin dummies are correlated we also analyzed the data by merging Other and Admin categories to create a single Developer dummy variable. In doing so we found that the HLM results did not change appreciably. 3 Using a binary dummy variable for measuring turnover and past activity results in loss of variance information and right censoring of data in developer activity levels. Please see the limitations section for how we intend to remedy this problem in the future.

Examining Turnover in Open Source Software Projects Using Logistic Hierarchical Linear Modeling Approach 3 Size of Project The number of developers with commit access to the project s CVS/SVN code repository. 2.3 Statistical Models and Results The Hierarchical Linear Modeling (HLM) technique allows researchers to model developer level outcomes within projects and model any between project differences that arise. The study was carried out in two parts and follows the approach recommended by Rumberger [9]. In the first part a developer model of turnover was developed and tested with logistic regression using only developer level variables. This allows an analysis focused only on developer level variables. However, this not only ignores project level variables but also assumes that the effects of developer level variables on turnover do not vary from project to project. This assumption was tested in the second part of the study using logistic HLM analysis. The developer level model used in this part of the study was based on the results of the first part. It allowed us to focus the analysis on explaining between project differences in the predicted mean turnover rates (turnover characteristics adjusted for differences in developer characteristics between projects) and between project differences in the effects of developer level variables on turnover rates. 2.4 Logistic Models A series of linear logistic models were developed and tested to measure the effect of developer level variables on turnover behavior. Turnover is a binary dependent variable that can be expressed as a probability p i, which takes on the value of unity if the developer i becomes inactive in the project, zero otherwise. The probability p is transformed into log of odds (or logit) which is expressed as: Log [p i / (1-p i )] = β 0 + β 1 Past_Activity + β 2 Tenure + β 3 Developer + β 4 Admin + β 5 Number_of_Projects Table 1 presents the exponentiated logistic coefficients, which represent the ratio of predicted odds of turnover with a one unit increase in the independent variable to the predicted odds without one unit increase. Thus, a value of one signifies no change in the odds of turnover. A value greater than (less than) one indicates that the odds of turnover increase (decrease) due to a unit change in independent variable. Table 1. Predicted odds of turnover Variable Univariate Multivariate estimates estimates Past_Activity 371.429** 431.724** Admin.443*.989 Developer 1.104.548 Tenure 1.008 1.007 Number_of_Projects.981 1.038-2LL (initial = 266.583) 109.008 Cox and Snell R 2.543 Nagelkerke R 2 Δχ 2 = 157.57 (p <.001) *p < 0.05, **p <.001.740

4 Pratyush N Sharma 1, John Hulland 2 and Sherae Daniel 1 The univariate and multivariate estimates of Past_Activity are both significant. The univariate estimate suggests that inactive developers have 371.42% higher odds of turnover than developers that were active. Unsurprisingly, inactive developers did not become active at a later stage. The univariate estimate of Admin is also significant and suggests that administrators have 44.3% lower odds of turnover than the Other category. This means that administrators are more than twice as likely to remain active than developers with Other roles. Since Past_Activity and Admin were significant in the univariate estimates they were retained for further HLM analysis. 2.5 HLM Models HLM analysis requires two types of models: a level 1 model to estimate the effects of developer level variables on turnover and a level 2 model to estimate the effect of project level variables on the coefficients of the level 1 analysis. We begin the analysis by modeling the unconditional model (base model) with no predictors at either level. 2.6 Unconditional Model Log [p ij / (1-p ij )] = β 0j β 0j = γ 00 + u 0j This model allows us to ascertain the variability in the outcome variable at each of the two levels i.e. within project and between project variability. The results are shown in Table 2. Table 2. Unconditional Model Fixed effect Coefficient se p value Average project mean γ 00.484.183 0.012 Random effect Variance df χ2 p value component Project mean, u oj.314 39 57.48.028 Deviance (-2LL) 631.288 Estimated parameters 2 The Null hypothesis H 0 : τ 00 = 0 is rejected (p =.028). This suggests that significant variation exists among projects in their turnover rates. The intraclass correlation coefficient (ICC) measures the proportion of variance in the outcome that is between projects [11]. ICC values for our analysis suggest that 8.71% variation in turnover that can be explained by level 2 predictors resides between projects. Further, for a project with a typical turnover rate (with u 0j = 0), the expected log odds of turnover is.484. This corresponds to a probability of 1/ (1 + e (.484)) =.38. This means that for a typical developer in a typical project there is a 38% chance of turnover in a 2 month period. 2.7 Conditional Model This model allows part of the variation in the intercept β 0 (mean turnover rates) to be explained by project level variables (project age and size), Log [p ij / (1-p ij )] = β 0j + β 1j Past_Activity + β 2j Admin β 0j = γ 00 + γ 01 Proj_Age + γ 02 Proj_Size + u 0j β 1j = γ 10

Examining Turnover in Open Source Software Projects Using Logistic Hierarchical Linear Modeling Approach 5 β 2j = γ 20 All the variables were grand mean centered to reduce multicollinearity concerns in group level estimation [5]. Table 3 presents the results of the conditional model. Table 3. Conditional Model Fixed effect Coefficient se p value Average project mean γ 00 1.86.49 0.001 Proj_Age Slope γ 01.005.006 0.461 Proj_Size Slope γ 02-0.012 0.008 0.154 Past_Activity Slope γ 10 5.951 1.161 0.000 Admin Slope γ 20 0.041 0.527 0.938 Random effect Variance component df χ2 p value Project mean, u oj 0.057 37 29.48 >.500 Deviance (-2LL) 477.808 Estimated parameters 6 The Null hypothesis H 0 : τ 00 = 0 fails to be rejected (p >.500). This means that after controlling for project size and age no significant variation remains to be explained. The proportion of reduction in variance or variance explained at level 2 is.8184, implying that project size and age account for 81.84% of the explained variance at level 2. The Deviance (-2 Log Likelihood) is also significantly improved from the base model (ΔD = 153.48, χ2 df = 4, p <.001), suggesting a good model fit and a fully identified model 4. 3 Limitations & Future Directions Like all empirical work this study is limited in many ways. First, the sample is biased toward more active projects. Such projects may have well developed infrastructures allowing retention of active members and/or a constant inflow of newer active members. Including less active projects in the future should allow for more robust and generalizable results. Second, the use of binary variables for turnover and past activity leads to loss of variance information and right censoring of the data. To address this critical issue in the future, we will rely on techniques such as survival modeling that allows inference from right censored data. Finally, we will seek a conceptual integration of developer and project level factors in modeling turnover rather than just an empirical integration. 4 Conclusion In this preliminary study, we argued that taking both the developer and the project level factors into account will lead to a richer understanding of the issue of turnover in open source projects. Our analysis suggests that past activity, developer role, project size and project age are important predictors of turnover. We find that there exists a significant variation in mean turnover rates among projects on SourceForge and that project age and project size account for a sizable proportion of this variation. 4 A conditional model that included all developer level variables did not further improve deviance and was rejected in favor of the more parsimonious model presented here.

6 Pratyush N Sharma 1, John Hulland 2 and Sherae Daniel 1 5 References 1. Collofello, J., Rus, I., Chauhan, A., Smith-Daniels, D., Houston, D., and Sycamore, D.M. 1998. "A System Dynamics Software Process Simulator for Staffing Policies Decision Support," in Proceedings of the Thirty-First Hawaii International Conference on System Sciences, Hawaii. 2. Hars, A., and Ou, S. 2002. "Working for Free? Motivations for Participating in Open Source Projects," International Journal of Electronic Commerce (6:3), pp. 25-39. 3. Hertel, G., Niedner, S., and Herrmann, S. 2003. "Motivation of Software Developers in Open Source Projects: an Internet-Based Survey of Contributors to the Linux Kernel," Research Policy (32), pp. 1159-1177. 4. Joyce, E., and Kraut, R.E. 2006. "Predicting Continued Participation in Newsgroups," Journal of Computer Mediated Communication (11), pp. 723-747. 5. Raudenbush, S.W. 1989. "Centering Predictors in Multilevel Analysis: Choices and consequences," Multilevel Modeling Newsletter (1), pp. 10-12. 6. Raudenbush, S.W. and Bryk, A.S. 2002. Hierarchical Linear Models (Second Edition), Thousand Oaks: Sage Publications. 7. Reel, J.S. 1999. "Critical Success Factors In Software Projects," IEEE Software (16:3), pp. 18-23. 8. Robles, G., and Gonzalez-Barahona, J.M. 2006. "Contributor Turnover in Libre Software Projects," in IFIP International Federation for Information Processing: Open Source Systems, Springer, Boston. 9. Rumberger, R.W. 1995. Dropping Out of Middle School: A Multi-Level Analysis of Students and Schools, American Educational Research Journal (32:3), pp.583-625. 10. Stewart K.J., Ammeter, T.A. and Maruping, L. 2006. Impacts of License Choice and Organizational Sponsorship on User Interest and Development Activity in Open Source Software Projects, Information Systems Research (17:2), pp. 126-144. 11. Snijders, T. and Bosker, R. 1999. Multi Level Analysis: An Introduction to Basic and Advanced Multilevel Modeling, Sage Publications.