Undergraduate Category: Engineering and Technology Degree Level: Computer Engineering Abstract ID# 485
Active Files as a Measure of Software Maintainability Lukas Schulte Northeastern University
Hitesh Sajnani University of California, Irvine
Jacek Czerwonka Microsoft Corp. Results: Active Files Exist: (2-8% of system files)
Algorithmic Methodology:
Abstract: In this paper, we explore the set of source files which are changed unusually often. We define these files as active files. Although discovery of active files relies only on version history and defect classification, the simple concept of active files can deliver key insights into software development activities. Active files can help focus code reviews, implement targeted testing, show areas for potential merge conflicts and identify areas that are central for program comprehension. In an empirical study of six large software systems within Microsoft ranging from products to services, we found that active files constitute only between 2-8% of the total system size, contribute 20-40% of system file changes, and are responsible for 60-90% of all defects. Not only this, but we establish that the majority, 65-95%, of the active files are architectural hub files which change due to feature addition as opposed to fixing defects.
Introduction:
Active Files
System Growth
Developers
Defining an Active File
Defining a Recurrently Active File: Active Files Change Often: (20-40% of system file changes)
Defining IsAF(f,β¦,5): The function IsAF f, d, t, n tests whether a file f is active at date d, with activity window t, and recurrence n. πΌπ π΄πΉ π, π, π‘, 0 = ππππ π πΌπ π΄πΉ π, π, π‘, 1 = π β β π β§ βπ β² . π β π‘ β€ π β² < π β§ π β β π β² πΌπ π΄πΉ π, π, π‘, π + 2 = πΌπ π΄πΉ π, π, π‘, 1 β§ πΌπ π΄πΉ π, π β π‘, π‘, π + 1 In our analyses we use πΌπ π΄πΉ π, π, π‘, 1 to define active files, and πΌπ π΄πΉ π, π, π‘, 5 to define recurrently active files. Next, we aggregate the set of all active files in a given timeframe. This is defined as π΄πΉπ π‘, π where M describes the timeframe, t the active window, and recurrence n. π΄πΉπ π‘, π =
Changes
Data Collection Methodology: Source code history is the primary data set used in this study. CodeMine monitors all code change submissions in the productsβ source code repositories, including the author, time of change, source files involved, the size of change, change comments, and any code movement between branches the change subsequently had been integrated into. Product Exchange Windows Phone Office 365 Bing Infrastructure Bing UX Previous Bing UX Current
Release Cadence Months Months Weeks Months Days Hours
Timeframe Legend for Graphs Jun 2010 β Jul 2013 Dec 2010 β Jul 2013 Nov 2011β Jul 2013 Mar 2006 β Jul 2013 Jun 2007 β Apr 2012 Dec 2011 β Jul 2013
Active Files are Systemic: (Every system shows a recurrently active file set)
πβπππππ(π)
Defects
Fact 1: Active software products continually evolve [1, 2]. Fact 2: To capture system changes, existing methodologies use structural metrics, e.g. Code Complexity [3, 4]. Premise: There may be many thousands of files present in the system but only a small subset of them might be in need of continuous change i.e. remain βactiveβ. Idea: Capture system evolution in a behavioral metric.
π πΌπ π΄πΉ(π, π, π‘, π)}
Active Files are Very Failure Prone: (60-90% of system file defects)
Conclusion & Implications: 2-8%
All Files
20-40%
60-85%
Active Set
Churn in Active Set
Inactive Set
Churn in Inactive Set
Buggy files in Active Set Others
Code Review and Testing Policies: β’ Active files are an elegant way to find hot-spots of activity and focus on them Monitoring Architectural Health β’ Active files show us how bad the problems are; defect density of the active set β’ Active files indicate where the problems are; architecturally key locations β’ Recurrently active files display the systemic activity in the system: which hub files are a constant drain on developer resources. Managing Merge Conflicts β’ Proactive notifications of pending changes occurring on the same active file will help resolve possible conflicts as early as possible Program Comprehension β’ Large software systems create an information flood is an initial barrier to effective development. Focusing on active files likely provide the quickest route to understanding a systemβs underlying architecture.`
We see no major differences between the active files across the different products even though these projects span thousands of developers, more than seven years, and hundreds of millions of lines of source code. References: [1] Lehman M. M, Belady L. A., Program Evolution: Processes of Software Change, Academic Press, 1985 [2] Lehman M. M, Ramil J. F., Wernick P, Turski W., Metrics and Laws of Software Evolution β The Nineties View, Proc. International Software Metrics Symposium, 1997 [3] DeMarco T., Controlling Software Projects: Measurement, Management and Estimation, Prentice Hall/Yourdon Press, 1982 [4] Chidamber S., Kemerer C., A Metric Suite for Object Oriented Design., IEEE Transactions on Software Engineering., vol 20 no. 6, 1994
Acknowledgements: We thank Tom Zimmermann, Wolfram Schulte, Nachi Nagappan, Michaela Greiler, Brian Bussone, Crista Lopes, and Brendan Murphy for their critical eyes, Varun Singh for his help with CodeMine queries, Danny van Velzen for providing us with domain expertise in the Bing source code repositories, and the product groups for providing us with their data.