Author Entropy vs. File Size in the GNOME Suite of Applications
Abstract
We present the results of a study in which author
entropy was used to characterize author contributions
per file. Our analysis reveals three patterns: banding in
the data, uneven distribution of data across bands, and
fifile size dependent distributions within bands. Our results suggest that when two authors contribute to a fifile,
large fifiles are more likely to have a dominant author
than smaller fifiles.
1 Introduction
As software systems evolve and grow they become
more complex [1]. One measure of system complexity
is author entropy, which characterizes software author-ship patterns. Author entropy is a summary statistic that quantififies the contributions of authors to fifiles.
Files with dominant authors have low entropy; fifiles
without dominant authors have high entropy.
In this paper we present our investigation of the relationship between author entropy and fifile size in the
GNOME projects suite. Specifically, we focus on the
two-author case. This study continues the research ini-
tiated by Taylor et al [3].
2 Methods
We selected projects from the GNOME suite of desk-
top applications in response to the 2009 MSR Data
Mining Challenge. In this study we filtered and cal-
culated metrics (as described in section 2.1) for all of
the 576 projects in the GNOME suite, as of February
2009. We report results from a visual analysis of ten
of these projects.
2.1 Producing the Data Sample for Visual
Analysis
We manually selected 10 projects for visualization
based on maturity and size. Mature projects are more
likely to display long-term project patterns, and large
projects provide more data for visualization.
Each project was filtered to exclude all non-source-
code files. The filtering compared the extension of each
file to 107 known source-code file extensions [2]. To fo-
cus our results on the two-author entropy case, we fur-
ther filtered the projects to exclude all files composed
by either a single author or by more than two authors.
After filtering, we computed two metrics for each
file in each project: file size (measured in lines of code,
LOC) and author entropy (described in section 2.2).
2.2 Author Entropy
We use author entropy to characterize the distribu-
tion of author contributions for each file. In this section
we brie
y describe author entropy. For a more detailed
explanation, see [3].
Entropy is a measure of chaos or disorder in a sys-
tem1. The concept of entropy originated in thermody-
namics but has been borrowed by information theory.
For our purposes, we use author entropy as a measure
of the distribution of author contributions within a file.
The general form for entropy, shown in Equation 1,
states that for c authors each author's contribution to
the total entropy is pi log2pi, where pi is the proportion
of lines in the file written by author i.
E(S) =
Xc
i=1
(pi log2pi) (1)
1Low entropy does not necessarily indicate software quality
or \goodness."
Jason R. Casebolt, Jonathan L. Krein, Alexander C. MacLean, Charles D. Knutson
SEQuOIA Lab, Brigham Young University
caseb106@gmail.com, jonathankrein@byu.net, amaclean@byu.net, knutson@cs.byu.edu
Daniel P. Delorey
Google, Inc.
dandelorey@gmail.com
978-1-4244-3493-0/09/$25.00 © 2009 IEEE 91 MSR 2009
Authorized licensed use limited to: Brigham Young University. Downloaded on July 30,2010 at 16:21:29 UTC from IEEE Xplore. Restrictions apply.