background

Niko's Project Corner

JGit blame for fun and profit(?)

Description Analysis of git commits on OOS projects
Languages Clojure
Tags Git
Elas­tic­search
Duration Fall 2017
Modified 2nd April 2018
thumbnail

Soft­ware pro­jects are typ­ically "tracked" on a ver­sion con­trol sys­tem (VCS). Each "ver­sion" of the code is called a "com­mit", which does not only store file con­tents, but also plenty of meta­data. This cre­ates a very rich set of data, and in the age of open source there are thou­sands of pro­jects to study. A few ex­am­ples are Git of The­seus and Gi­ten­tial, but by fo­cus­ing on "git blame" (see who has com­mit­ted each line on each file) I hope to bring some­thing new to the table. In short I have an­alyzed how source code gets re­placed by newer code, track­ing the top­ics of who, when and why, and how old the code was.

Git's com­mit log is a di­rected acyclic graph (DAG), which can be vi­su­al­ized by gitk (an ex­am­ple is shown at Fig­ure 1). A com­mit can have ar­bi­trary many par­ents and chil­dren, but the most typ­ical cases are just one or two. Com­mits from which a new branch has been cre­ated have more than one child, and "merge" com­mits have more than one par­ent. As in real life par­ents are "older" than chil­dren, and the list of a com­mit's par­ents is im­mutable but new chil­dren might (un­ex­pect­edly or not) ap­pear.

gitk
Figure 1: Clo­jure pro­ject's com­mits vi­su­al­ized by gitk.

Git doesn't ac­tu­ally store com­mits as "deltas" from the pre­vi­ous ver­sion, but in­stead refers to com­plete fold­ers and files (for de­tails have a look at Git in­ter­nals). This struc­ture has some in­ter­est­ing prop­er­ties, for ex­am­ple when cal­cu­lat­ing the dif­fer­ence be­tween two com­mits once can skip en­tire sub-di­rec­to­ries by just de­ter­min­ing that tree-ob­jects' hashes are iden­ti­cal. When two files dif­fer, the end user typ­ically wants to ig­nore the iden­ti­cal parts and just fo­cus on what was changed (a "patch"). Cal­cu­lat­ing this rep­re­sen­ta­tion is ac­tu­ally a sur­pris­ingly com­plex task, some al­go­rithms are ex­plained at Stack­Over­flow. An ex­am­ple com­mit vi­su­al­iza­tion on gitk is shown at Fig­ure 2.

example_commit
Figure 2: An ex­am­ple com­mit from Clo­jure's repos­itory (com­mit 9b80a552fd­abe...).

In this pro­ject I used JGit to ex­tract in­for­ma­tion from a repos­itory (aka. "repo"), but as I was us­ing Clo­jure I was happy to find a JGit wrap­per clj-jgit for it. At the time of writ­ing my source code is only about 200 lines of code long, but it had to deal with some com­plex­ities such as how to cache in­ter­me­di­ate re­sults to speed up de­vel­op­ment and re-runs, in­den­ti­fy­ing nu­ances such as JGit throw­ing a null pointer ex­cep­tion when blam­ing a sym­link, how to rep­re­sent files "blames" and how to ap­ply patches to them, and how to store all the re­sults to Elas­tic­search for fu­ture anal­ysis.

The first step on a repos­itory's anal­ysis is to load it from disk to mem­ory via the jgit/load-repo, which sets the con­text for all fu­ture API calls such as jgit/git-blame. The sec­ond step is to find all com­mits since a given date (as we might want to fo­cus on last X years of the pro­ject), and find all "root" com­mits. These are com­mits which have no par­ents, or all of their par­ents were cre­ated be­fore the cho­sen cut­off date. From these root com­mits we can find all com­mits which fol­low, and par­ti­tion them into a lin­ear se­quences (each com­mit has ex­actly one par­ent) what I de­cided to call "hash-chains". I de­cided to ig­nore merge com­mits on this anal­ysis, al­though they are crit­ical points in branches lives. Merges do not cre­ate new code per se, un­less there is a merge con­flict.

At this stage we are closer to run­ning the ac­tual anal­ysis. Cal­cu­lat­ing git blame on a large repo turns out to be a quite ex­pen­sive op­er­ation, be­cause it needs to de­ter­mine for all lines in all files that from which com­mit it orig­inates. Luck­ily it is suf­fi­cient to cal­cu­late it only for root com­mits (in­stead of all of them) and only for files which have changes within the com­mits we are in­ter­ested of. In ad­di­tion it is triv­ial to par­al­lelize. Cal­cu­lat­ing patches be­tween two com­mits is or­ders of mag­ni­tude faster, so this anal­ysis can be run rea­son­ably fast by us­ing "root blames" as a start­ing point and us­ing in­cre­men­tal patches to de­ter­mine how the "blame state" looks like at sub­se­quent com­mits. Putting all this to­gether (and some post-pro­cess­ing) re­sults in a se­quence of hash-maps with a very rich set of in­for­ma­tion, an ex­am­ple is shown at Fig­ure 3. It has the chunk's old and new com­mits' meta­data (date, folder, file name + type, au­thor's name and email and com­mit's hash + mes­sage) and de­rived in­for­ma­tion in­clud­ing the age of the code and if the new au­thor is the same as the old one. The "id" of the data item is gen­er­ated in a de­ter­min­is­tic man­ner from com­mit hashes and the file path so that re-in­dex­ing all data to Elas­tic­search will not leave old data be­hind, but will in­stead over­write it.

document
Figure 3: An ex­am­ple hash-map, show­ing data on rows on a file src/jvm/clo­jure/lang/Com­piler.java from the com­mit d3ab... be­ing re­placed by newer code from the com­mit 9b80a..., end­ing its 1969-days long con­tri­bu­tion to Clo­jure.
reauth_1
Figure 4: The age dis­tri­bu­tion of re­placed code, with a color in­di­ca­tion on "re-au­thor­ship".

Nat­urally one can sim­ply keep all data in-mem­ory and run any anal­ysis on it, but Elas­tic­search + Kibana has been proven to be a very scal­able and rel­atively easy-to-use so­lu­tion for dash­boards and ad-hoc anal­ysis and search. Thus I de­cided to use it on this pro­ject as well. I can think of dozens of anal­yses to test on this, but at this time I'll just de­scribe a very sim­ple one. We all might com­mit buggy code and push it ev­ery now and then, and when this hap­pens we rush to fix the mis­take. Also typ­ically when a group of soft­ware de­vel­op­ers are build­ing a pro­ject, pro­gram­mers tend to work on sep­arate parts of the code base. By us­ing se­man­tic anal­ysis on the com­mit mes­sage it should be pos­si­ble to dis­tin­guish these two cases, but I haven't tried it here.

In­stead I just used Kibana's area charts to study the cor­re­la­tion be­tween "re-au­thor­ship" (when the au­thor of the new code is the same per­son who wrote the older ver­sion) and the age of the code. As the age of the code can vary from min­utes to years, I de­cided to run the vi­su­al­iza­tions in a log2 scale (for example 0 corresponds to 20 = 1 days and 10 is 210 = 1024 days or about 2.8 years). A sample area chart is shown in Figure 4. Any analysis based on plain raw histograms is a bit difficult because different repos have wildly different numbers of commits, so in Figures 5 and 6 the y-axis shows the actual percentage of re-authorship. There is a clear pattern that when a piece of code gets replaced, the younger it is the more likely it is that the new author is the same as the old one. I wonder what kinds of more sophisticated analysis is possible, for example doing more analysis on commit messages or trying to forecast buggy areas of the code base.

reauth_2
Figure 5: The age dis­tri­bu­tion of re­placed code, with a color in­di­ca­tion on "re-au­thor­ship" on repos­ito­ries of Apache Spark, An­gu­lar.js and the git itself.
reauth_3
Figure 6: The age dis­tri­bu­tion of re­placed code, with a color in­di­ca­tion on "re-au­thor­ship" on repos­ito­ries of Elas­tic­search, Node.js and Clo­jure.

Related blog posts:

BenchmarkTaxiridesEsSql
CljTaxirides
HierarchicalSchemaEs
FuzzySearchEs
ServerMon