I am honored to be in charge of Christmas Eve at Lux Advent Calendar 2016. I am responsibly posting to Eve (while watching Music Station).
The git log
command is a handy tool for version control that lets you know when, who committed what and how. However, if you want to make use of this for quality control and analysis, you need to devise a little. In Introduction, I tried to summarize how to format git log
with one liner, but there is a limit to what you can do with one line. In actual work, there are other things to do after shaping the logs, such as collecting logs every morning to check the development status, planning tests and evaluating quality at the end of the project, so this time I thought about how to use the log practically.
So, this time I summarized how to aggregate and analyze Git logs using Git Python. I think there is a way to handle Git in other languages, but I am good at aggregation and analysis because it is a language that can easily rewrite the code so that it is easy to check the quality according to the situation of the project. I thought Python was easy to handle as a scripting language. By the way, Python has only written disposable level code such as a little aggregation like this time, so please forgive me though there may be a poor part of the code. Please comment if you have any.
The sample code summarized here is confirmed with Python 2.7
.
GitPython
To work with Git in Python, there is a handy library called GitPython. Please refer to Official Documents for installation procedures.
Get by specifying the local repository with Repo ('/ path')
. And you can get commit information for a particular branch with Repo.iter_commits ..
from git import *
import datetime, time
repo = Repo('./')
for item in repo.iter_commits('master', max_count=10):
dt = datetime.datetime.fromtimestamp(item.authored_date).strftime("%Y-%m-%d %H:%M:%S")
print("%s %s %s " % (item.hexsha, item.author, dt))
For the commit information that can be obtained, refer to Objects.Commit API Reference.
The above example outputs the hash value of the last 10 Git logs, the commit user, and the commit date and time from the master branch on the repository in the current directory.
Output example
ddffe26850e8175eb605f975be597afc3fca8a03 Sebastian Thiel 2016-12-22 20:51:02
3d6e1731b6324eba5abc029b26586f966db9fa4f Sebastian Thiel 2016-12-22 20:48:59
82ae723c8c283970f75c0f4ce097ad4c9734b233 Sebastian Thiel 2016-12-22 20:44:14
15b6bbac7bce15f6f7d72618f51877455f3e0ee5 Sebastian Thiel 2016-12-22 20:35:30
c823d482d03caa8238b48714af4dec6d9e476520 Sebastian Thiel 2016-12-09 00:34:04
b0c187229cea1eb3f395e7e71f636b97982205ed Sebastian Thiel 2016-12-09 00:07:11
f21630bcf83c363916d858dd7b6cb1edc75e2d3b Sebastian Thiel 2016-12-09 00:01:35
06914415434cf002f712a81712024fd90cea2862 Sebastian Thiel 2016-12-08 22:32:58
2f207e0e15ad243dd24eafce8b60ed2c77d6e725 Sebastian Thiel 2016-12-08 21:20:52
a8437c014b0a9872168b01790f5423e8e9255840 Vincent Driessen 2016-12-08 21:14:27
By the way, the above output example is the commit log of GitPython as of December 23, 2016.
The Commit Limiting of git log
introduced in the introduction can be specified by the argument of ʻiter_commits (in the above example,
max_count = 10is specified). You can specify more than one with
,. Also, replace the
- part with
_to specify. By the way, if you specify
no_merge, a syntax error will occur. As described in [here](https://git-scm.com/docs/git-rev-list#git-rev-list --- no-merges), as an optional specification of
git logSince it is the same as
max-parents = 1, I was able to use
max_parents = 1`.
You can also specify the double-dot syntax introduced in the introduction. This simply makes the master
part above look like master..experiment
.
Now let's consider the application to log aggregation and analysis. In the commit information obtained by the above code, all the information of the committed file is hung. You can get a list of committed file information with stats.files
.
This is an example of standard output in CSV format together with information such as the date and time when the added line and deleted line for each committed file were committed.
from git import *
import datetime, time
repo = Repo('./')
print('hexsha,author,authored_date,file_name,deletions,lines,insertions')
for item in repo.iter_commits('master', max_count=10):
file_list = item.stats.files
for file_name in file_list:
dt = datetime.datetime.fromtimestamp(item.authored_date).strftime("%Y-%m-%d %H:%M:%S")
insertions = file_list.get(file_name).get('insertions')
deletions = file_list.get(file_name).get('deletions')
lines = file_list.get(file_name).get('lines')
print("%s,%s,%s,%s,%s,%s,%s" % (item.hexsha, item.author, dt, file_name, insertions, deletions, lines))
It is quite difficult to get this information with just a command and put it on one line, but with GitPython you just need to go around the loop and get it.
Now let's count the number of changes made during the period by file instead of commit. The following outputs the number of commits within 6 months for each file.
from git import *
import datetime, time
repo = Repo('./')
print('file_name,commit_count')
file_list = {}
for item in repo.iter_commits('master', since='6 months ago'):
for fileName in item.stats.files:
if file_name not in file_list:
fileList[fileName] = []
author = {}
author[item.author] = datetime.datetime.fromtimestamp(item.authored_date).strftime("%Y-%m-%d %H:%M:%S")
file_list[file_name].append(author)
for file_name in file_list:
print("%s,%d" % (file_name, len(fileList[file_name])))
Furthermore, by applying this, it seems possible to aggregate the number of changed lines for each file within the project period, and to calculate the commit interval (the number of days since the last change). You can aggregate not only by file but also by day, month, committed person, etc.
If Python can be used for aggregation, logs can be used for new means such as importing into other Python libraries for analysis and linking with external tools and services.
So, let's think about associating analysis of commit information. What I want to do is get the information that "the person who changed this file also changed this". In Python, you can perform association analysis with a library called Orange.
According to the official documentation Association rules and frequent itemsets, the list to be analyzed is in CSV format .basket If you save it with the extension
, Orange seems to analyze the association. Therefore, we will output the files committed together in CSV format with one commit.
First of all, by applying the above method, the files committed at the same time using GitPython are output separated by commas.
commit-file-list.py
from git import *
repo = Repo('./')
for item in repo.iter_commits('master', since='1 years ago'):
print(",".join(item.stats.files.keys()))
$ python commit-file-list.py > commit-file-list.basket
Let's analyze the association of commit-file-list.basket
created above by referring to the sample code of Orange.
import Orange
data = Orange.data.Table('commit-file-list.basket')
rules = Orange.associate.AssociationRulesSparseInducer(data, support=0.02, confidence=0.5)
print "%4s %4s %4s" % ("Supp", "Conf", "Rule")
for r in rules:
if 'git/config.py' in r.name:
print "%4.1f %4.1f %s" % (r.support, r.confidence, r)
Output example
Supp Conf Rule
0.0 0.6 git/test/test_git.py -> git/cmd.py
0.0 0.2 git/cmd.py -> git/test/test_git.py
0.1 0.7 git/test/test_diff.py -> git/diff.py
0.1 0.7 git/diff.py -> git/test/test_diff.py
0.0 0.4 git/test/test_diff.py -> git/diff.py doc/source/changes.rst
support
is not very important in this case as it is the percentage of the rule that appears in the whole. Therefore, set the threshold value as low as possible. Since confidence
is the ratio of the entire rule appearing on the precondition of a part of the rule (the ratio of A and B based on all patterns including A), this value is exactly this value in this case. I would like to emphasize the high rules.
First, output the file to be merged in CSV format with one line for each commit, and create a .basket
file. Since it is a target to be merged into master, it gets a commit using double dot syntax and outputs the file associated with it.
merge-target-list.py
from git import *
repo = Repo('./')
for item in repo.iter_commits('master..experiment', max_parents=1):
print(",".join(item.stats.files.keys()))
$ python merge-target-list.py > merge-target-list.basket
It may be better to apply the method of extracting rules based on the above-mentioned Git log for one year, find the one whose merge-target-list.basket
to be merged matches the left of the rule, and commit right as well. Output the result that it cannot be done (for example, A, B, C are extracted as a rule from the log for one year, and if A, B exists in the merge target, C is assumed to be output as a candidate).
import Orange
data = Orange.data.Table('commit-file-list.basket')
rules = Orange.associate.AssociationRulesSparseInducer(data, support=0.02, confidence=0.5)
for r in rules:
if 'git/config.py' in r.name:
print "%4.1f %4.1f %s" % (r.support, r.confidence, r)
merge_data = Orange.data.Table('merge-target-list.basket')
for d in merge_data:
for rule in rules:
#print rule
if rule.applies_left(d):
print (u"%Who committed s%3.1f %%At the rate of%s is also committing" %(rule.left.get_metas(str).keys(), (rule.confidence*100), rule.right.get_metas(str).keys()))
Output example
['git/test/test_remote.py']Who committed 55.0 %At the rate of['git/test/lib/helper.py']Is also committed
['git/test/test_remote.py']Who committed 55.0 %At the rate of['git/test/test_base.py']Is also committed
['git/test/test_remote.py']50 people who committed.0 %At the rate of['git/util.py']Is also committed
['git/test/test_base.py']Who committed 55.0 %At the rate of['git/test/test_git.py']Is also committed
...(abridgement)
Candidates were output more than I expected to set confidence to 0.5. I think it's best to tune this area according to the situation and characteristics of the project.
Since Python has many useful libraries, it seems that it can be applied not only to association analysis but also to various applications. Actually, I was thinking about using matplotlib that can handle graphs and making ChatBot, but this time it is because the association analysis has become longer than I expected. I would like to continue with my winter vacation homework.
Run Apriori from Python with Orange Format Git logs with one liner
The fun Lux Advent Calendar 2016 is finally the final day tomorrow. @kawanamiyuu will conclude the end so please look forward to it. Have a nice Christmas, everyone.
Recommended Posts