Aggregate Git logs using Git Python and analyze associations using Orange

I am honored to be in charge of Christmas Eve at Lux Advent Calendar 2016. I am responsibly posting to Eve (while watching Music Station).


Introduction

The git log command is a handy tool for version control that lets you know when, who committed what and how. However, if you want to make use of this for quality control and analysis, you need to devise a little. In Introduction, I tried to summarize how to format git log with one liner, but there is a limit to what you can do with one line. In actual work, there are other things to do after shaping the logs, such as collecting logs every morning to check the development status, planning tests and evaluating quality at the end of the project, so this time I thought about how to use the log practically.

Aggregation using Python

So, this time I summarized how to aggregate and analyze Git logs using Git Python. I think there is a way to handle Git in other languages, but I am good at aggregation and analysis because it is a language that can easily rewrite the code so that it is easy to check the quality according to the situation of the project. I thought Python was easy to handle as a scripting language. By the way, Python has only written disposable level code such as a little aggregation like this time, so please forgive me though there may be a poor part of the code. Please comment if you have any.

The sample code summarized here is confirmed with Python 2.7.

GitPython

To work with Git in Python, there is a handy library called GitPython. Please refer to Official Documents for installation procedures.

Get commit information

Get by specifying the local repository with Repo ('/ path'). And you can get commit information for a particular branch with Repo.iter_commits ..

from git import *
import datetime, time

repo = Repo('./')
for item in repo.iter_commits('master', max_count=10):
    dt = datetime.datetime.fromtimestamp(item.authored_date).strftime("%Y-%m-%d %H:%M:%S")
    print("%s %s %s " % (item.hexsha, item.author, dt))

For the commit information that can be obtained, refer to Objects.Commit API Reference.

The above example outputs the hash value of the last 10 Git logs, the commit user, and the commit date and time from the master branch on the repository in the current directory.

Output example


ddffe26850e8175eb605f975be597afc3fca8a03 Sebastian Thiel 2016-12-22 20:51:02 
3d6e1731b6324eba5abc029b26586f966db9fa4f Sebastian Thiel 2016-12-22 20:48:59 
82ae723c8c283970f75c0f4ce097ad4c9734b233 Sebastian Thiel 2016-12-22 20:44:14 
15b6bbac7bce15f6f7d72618f51877455f3e0ee5 Sebastian Thiel 2016-12-22 20:35:30 
c823d482d03caa8238b48714af4dec6d9e476520 Sebastian Thiel 2016-12-09 00:34:04 
b0c187229cea1eb3f395e7e71f636b97982205ed Sebastian Thiel 2016-12-09 00:07:11 
f21630bcf83c363916d858dd7b6cb1edc75e2d3b Sebastian Thiel 2016-12-09 00:01:35 
06914415434cf002f712a81712024fd90cea2862 Sebastian Thiel 2016-12-08 22:32:58 
2f207e0e15ad243dd24eafce8b60ed2c77d6e725 Sebastian Thiel 2016-12-08 21:20:52 
a8437c014b0a9872168b01790f5423e8e9255840 Vincent Driessen 2016-12-08 21:14:27 

By the way, the above output example is the commit log of GitPython as of December 23, 2016.

Specifying Commit Limiting

The Commit Limiting of git log introduced in the introduction can be specified by the argument of ʻiter_commits (in the above example, max_count = 10is specified). You can specify more than one with,. Also, replace the - part with _to specify. By the way, if you specifyno_merge, a syntax error will occur. As described in [here](https://git-scm.com/docs/git-rev-list#git-rev-list --- no-merges), as an optional specification of git logSince it is the same asmax-parents = 1, I was able to use max_parents = 1`.

Specify revision range

You can also specify the double-dot syntax introduced in the introduction. This simply makes the master part above look like master..experiment.

Application to log aggregation and analysis

Now let's consider the application to log aggregation and analysis. In the commit information obtained by the above code, all the information of the committed file is hung. You can get a list of committed file information with stats.files.

Information details for each commit file

This is an example of standard output in CSV format together with information such as the date and time when the added line and deleted line for each committed file were committed.

from git import *
import datetime, time

repo = Repo('./')
print('hexsha,author,authored_date,file_name,deletions,lines,insertions')
for item in repo.iter_commits('master', max_count=10):
    file_list = item.stats.files
    for file_name in file_list:
        dt = datetime.datetime.fromtimestamp(item.authored_date).strftime("%Y-%m-%d %H:%M:%S")
        insertions = file_list.get(file_name).get('insertions')
        deletions = file_list.get(file_name).get('deletions')
        lines = file_list.get(file_name).get('lines')
        print("%s,%s,%s,%s,%s,%s,%s" % (item.hexsha, item.author, dt, file_name, insertions, deletions, lines))

It is quite difficult to get this information with just a command and put it on one line, but with GitPython you just need to go around the loop and get it.

Number of commits per file

Now let's count the number of changes made during the period by file instead of commit. The following outputs the number of commits within 6 months for each file.

from git import *
import datetime, time

repo = Repo('./')
print('file_name,commit_count')
file_list = {}
for item in repo.iter_commits('master', since='6 months ago'):
    for fileName in item.stats.files:
        if file_name not in file_list:
            fileList[fileName] = []
        author = {}
        author[item.author] = datetime.datetime.fromtimestamp(item.authored_date).strftime("%Y-%m-%d %H:%M:%S")
        file_list[file_name].append(author)

for file_name in file_list:
    print("%s,%d" % (file_name, len(fileList[file_name])))

Furthermore, by applying this, it seems possible to aggregate the number of changed lines for each file within the project period, and to calculate the commit interval (the number of days since the last change). You can aggregate not only by file but also by day, month, committed person, etc.

Utilization of Git logs

If Python can be used for aggregation, logs can be used for new means such as importing into other Python libraries for analysis and linking with external tools and services.

Association analysis

So, let's think about associating analysis of commit information. What I want to do is get the information that "the person who changed this file also changed this". In Python, you can perform association analysis with a library called Orange.

How to use Orange

According to the official documentation Association rules and frequent itemsets, the list to be analyzed is in CSV format .basket If you save it with the extension , Orange seems to analyze the association. Therefore, we will output the files committed together in CSV format with one commit.

Association analysis of files committed in the last year

First of all, by applying the above method, the files committed at the same time using GitPython are output separated by commas.

commit-file-list.py


from git import *

repo = Repo('./')
for item in repo.iter_commits('master', since='1 years ago'):
    print(",".join(item.stats.files.keys()))
$ python commit-file-list.py > commit-file-list.basket

Let's analyze the association of commit-file-list.basket created above by referring to the sample code of Orange.

import Orange
data = Orange.data.Table('commit-file-list.basket')
rules = Orange.associate.AssociationRulesSparseInducer(data, support=0.02, confidence=0.5)
print "%4s %4s %4s" % ("Supp", "Conf", "Rule")
for r in rules:
    if 'git/config.py' in r.name:
        print "%4.1f %4.1f  %s" % (r.support, r.confidence, r)

Output example


Supp Conf Rule
 0.0  0.6  git/test/test_git.py -> git/cmd.py
 0.0  0.2  git/cmd.py -> git/test/test_git.py
 0.1  0.7  git/test/test_diff.py -> git/diff.py
 0.1  0.7  git/diff.py -> git/test/test_diff.py
 0.0  0.4  git/test/test_diff.py -> git/diff.py doc/source/changes.rst

support is not very important in this case as it is the percentage of the rule that appears in the whole. Therefore, set the threshold value as low as possible. Since confidence is the ratio of the entire rule appearing on the precondition of a part of the rule (the ratio of A and B based on all patterns including A), this value is exactly this value in this case. I would like to emphasize the high rules.

Search the association analysis rules for "I'm changing this file too" before merging into master

First, output the file to be merged in CSV format with one line for each commit, and create a .basket file. Since it is a target to be merged into master, it gets a commit using double dot syntax and outputs the file associated with it.

merge-target-list.py


from git import *

repo = Repo('./')
for item in repo.iter_commits('master..experiment', max_parents=1):
    print(",".join(item.stats.files.keys()))
$ python merge-target-list.py > merge-target-list.basket

It may be better to apply the method of extracting rules based on the above-mentioned Git log for one year, find the one whose merge-target-list.basket to be merged matches the left of the rule, and commit right as well. Output the result that it cannot be done (for example, A, B, C are extracted as a rule from the log for one year, and if A, B exists in the merge target, C is assumed to be output as a candidate).

import Orange
data = Orange.data.Table('commit-file-list.basket')
rules = Orange.associate.AssociationRulesSparseInducer(data, support=0.02, confidence=0.5)
for r in rules:
    if 'git/config.py' in r.name:
        print "%4.1f %4.1f  %s" % (r.support, r.confidence, r)

merge_data = Orange.data.Table('merge-target-list.basket')
for d in merge_data:
    for rule in rules:
        #print rule
        if rule.applies_left(d):
            print (u"%Who committed s%3.1f %%At the rate of%s is also committing" %(rule.left.get_metas(str).keys(), (rule.confidence*100), rule.right.get_metas(str).keys()))

Output example


['git/test/test_remote.py']Who committed 55.0 %At the rate of['git/test/lib/helper.py']Is also committed
['git/test/test_remote.py']Who committed 55.0 %At the rate of['git/test/test_base.py']Is also committed
['git/test/test_remote.py']50 people who committed.0 %At the rate of['git/util.py']Is also committed
['git/test/test_base.py']Who committed 55.0 %At the rate of['git/test/test_git.py']Is also committed
...(abridgement)

Candidates were output more than I expected to set confidence to 0.5. I think it's best to tune this area according to the situation and characteristics of the project.

in conclusion

Since Python has many useful libraries, it seems that it can be applied not only to association analysis but also to various applications. Actually, I was thinking about using matplotlib that can handle graphs and making ChatBot, but this time it is because the association analysis has become longer than I expected. I would like to continue with my winter vacation homework.

reference

Run Apriori from Python with Orange Format Git logs with one liner


The fun Lux Advent Calendar 2016 is finally the final day tomorrow. @kawanamiyuu will conclude the end so please look forward to it. Have a nice Christmas, everyone.

Recommended Posts

Aggregate Git logs using Git Python and analyze associations using Orange
Aggregate and analyze product prices using Rakuten Product Search API [Python]
Python development flow using Poetry, Git and Docker
Authentication using tweepy-User authentication and application authentication (Python)
Clustering and visualization using Python and CytoScape
Parsing Git commit logs in Python
Notes using cChardet and python3-chardet in Python 3.3.1.
From Python to using MeCab (and CaboCha)
Using Python and MeCab with Azure Databricks
I want to analyze logs with Python
I'm using tox and Python 3.3 with Travis-CI
Aggregate test results using the QualityForward Python library
Head orientation estimation using Python and OpenCV + dlib
I tried web scraping using python and selenium
Notes on installing Python3 and using pip on Windows7
Develop and deploy Python APIs using Kubernetes and Docker
I tried object detection using Python and OpenCV
Get git branch name and tag name with python
Analyze Apache access logs with Pandas and Matplotlib
Create a web map using Python and GDAL
[Python3] Let's analyze data using machine learning! (Regression)
[Python3] Automatic sentence generation using janome and markovify
Try using tensorflow ① Build python environment and introduce tensorflow
Create a Mac app using py2app and Python3! !!
Try using ChatWork API and Qiita API in Python