Undergraduate Category: Engineering and Technology Degree Level ...

Report 6 Downloads 133 Views
Undergraduate Category: Engineering and Technology Degree Level: Computer Engineering Abstract ID# 485

Active Files as a Measure of Software Maintainability Lukas Schulte Northeastern University

Hitesh Sajnani University of California, Irvine

Jacek Czerwonka Microsoft Corp. Results: Active Files Exist: (2-8% of system files)

Algorithmic Methodology:

Abstract: In this paper, we explore the set of source files which are changed unusually often. We define these files as active files. Although discovery of active files relies only on version history and defect classification, the simple concept of active files can deliver key insights into software development activities. Active files can help focus code reviews, implement targeted testing, show areas for potential merge conflicts and identify areas that are central for program comprehension. In an empirical study of six large software systems within Microsoft ranging from products to services, we found that active files constitute only between 2-8% of the total system size, contribute 20-40% of system file changes, and are responsible for 60-90% of all defects. Not only this, but we establish that the majority, 65-95%, of the active files are architectural hub files which change due to feature addition as opposed to fixing defects.

Introduction:

Active Files

System Growth

Developers

Defining an Active File

Defining a Recurrently Active File: Active Files Change Often: (20-40% of system file changes)

Defining IsAF(f,…,5): The function IsAF f, d, t, n tests whether a file f is active at date d, with activity window t, and recurrence n. 𝐼𝑠𝐴𝐹 𝑓, 𝑑, 𝑑, 0 = π‘“π‘Žπ‘™π‘ π‘’ 𝐼𝑠𝐴𝐹 𝑓, 𝑑, 𝑑, 1 = 𝑓 ∈ β„‚ 𝑑 ∧ βˆƒπ‘‘ β€² . 𝑑 βˆ’ 𝑑 ≀ 𝑑 β€² < 𝑑 ∧ 𝑓 ∈ β„‚ 𝑑 β€² 𝐼𝑠𝐴𝐹 𝑓, 𝑑, 𝑑, 𝑛 + 2 = 𝐼𝑠𝐴𝐹 𝑓, 𝑑, 𝑑, 1 ∧ 𝐼𝑠𝐴𝐹 𝑓, 𝑑 βˆ’ 𝑑, 𝑑, 𝑛 + 1 In our analyses we use 𝐼𝑠𝐴𝐹 𝑓, 𝑑, 𝑑, 1 to define active files, and 𝐼𝑠𝐴𝐹 𝑓, 𝑑, 𝑑, 5 to define recurrently active files. Next, we aggregate the set of all active files in a given timeframe. This is defined as 𝐴𝐹𝑀 𝑑, 𝑛 where M describes the timeframe, t the active window, and recurrence n. 𝐴𝐹𝑀 𝑑, 𝑛 =

Changes

Data Collection Methodology: Source code history is the primary data set used in this study. CodeMine monitors all code change submissions in the products’ source code repositories, including the author, time of change, source files involved, the size of change, change comments, and any code movement between branches the change subsequently had been integrated into. Product Exchange Windows Phone Office 365 Bing Infrastructure Bing UX Previous Bing UX Current

Release Cadence Months Months Weeks Months Days Hours

Timeframe Legend for Graphs Jun 2010 – Jul 2013 Dec 2010 – Jul 2013 Nov 2011– Jul 2013 Mar 2006 – Jul 2013 Jun 2007 – Apr 2012 Dec 2011 – Jul 2013

Active Files are Systemic: (Every system shows a recurrently active file set)

π‘‘βˆˆπ‘Ÿπ‘Žπ‘›π‘”π‘’(𝑀)

Defects

Fact 1: Active software products continually evolve [1, 2]. Fact 2: To capture system changes, existing methodologies use structural metrics, e.g. Code Complexity [3, 4]. Premise: There may be many thousands of files present in the system but only a small subset of them might be in need of continuous change i.e. remain β€˜active’. Idea: Capture system evolution in a behavioral metric.

𝑓 𝐼𝑠𝐴𝐹(𝑓, 𝑑, 𝑑, 𝑛)}

Active Files are Very Failure Prone: (60-90% of system file defects)

Conclusion & Implications: 2-8%

All Files

20-40%

60-85%

Active Set

Churn in Active Set

Inactive Set

Churn in Inactive Set

Buggy files in Active Set Others

Code Review and Testing Policies: β€’ Active files are an elegant way to find hot-spots of activity and focus on them Monitoring Architectural Health β€’ Active files show us how bad the problems are; defect density of the active set β€’ Active files indicate where the problems are; architecturally key locations β€’ Recurrently active files display the systemic activity in the system: which hub files are a constant drain on developer resources. Managing Merge Conflicts β€’ Proactive notifications of pending changes occurring on the same active file will help resolve possible conflicts as early as possible Program Comprehension β€’ Large software systems create an information flood is an initial barrier to effective development. Focusing on active files likely provide the quickest route to understanding a system’s underlying architecture.`

We see no major differences between the active files across the different products even though these projects span thousands of developers, more than seven years, and hundreds of millions of lines of source code. References: [1] Lehman M. M, Belady L. A., Program Evolution: Processes of Software Change, Academic Press, 1985 [2] Lehman M. M, Ramil J. F., Wernick P, Turski W., Metrics and Laws of Software Evolution – The Nineties View, Proc. International Software Metrics Symposium, 1997 [3] DeMarco T., Controlling Software Projects: Measurement, Management and Estimation, Prentice Hall/Yourdon Press, 1982 [4] Chidamber S., Kemerer C., A Metric Suite for Object Oriented Design., IEEE Transactions on Software Engineering., vol 20 no. 6, 1994

Acknowledgements: We thank Tom Zimmermann, Wolfram Schulte, Nachi Nagappan, Michaela Greiler, Brian Bussone, Crista Lopes, and Brendan Murphy for their critical eyes, Varun Singh for his help with CodeMine queries, Danny van Velzen for providing us with domain expertise in the Bing source code repositories, and the product groups for providing us with their data.