You can now create docx using python-docx in previous post. This time, as an application of this, I tried to get the result of diff. I will also expose it with the meaning of a memorandum. I hope it will be helpful for those who think about managing the difference information of files.

Operating environment

It is a system that is premised on taking a diff on Cygwin (laugh), so I checked it in the following environment.

Cygwin (32bit) / on Windows10
python2.7
python_docx-0.8.6-py2.7

For installation of python-docx, etc., see [python_docx article here](http://qiita.com/GDaigo/items/d5b46fc43c6250dd61b1#python-docx%E3%81%AE%E3%82%A4%E3%83% B3% E3% 82% B9% E3% 83% 88% E3% 83% BC% E3% 83% AB) may also be helpful.

What to do with diff

The usage of diff, which is the target analysis this time, is limited to the following.

Use the -u command to take the difference
The message is in English

Basically, it is assumed to be used in "diff -r -u3 <target 1> <target 2>". Just change the number after u, or compare the files alone instead of -r and it should probably work. Also, if the message is in Japanese, it must be temporarily in English as "export LANG = en_US".

When I actually do it, I get this difference (this is a part of the openssl code)

diff -r -u3 async_old/arch/async_win.c async_new/arch/async_win.c
--- async_old/arch/async_win.c	2017-07-07 08:19:02.000000000 +0900
+++ async_new/arch/async_win.c	2017-07-09 22:58:36.556937300 +0900
@@ -47,7 +47,12 @@
     return 1;
 }
 
-VOID CALLBACK async_start_func_win(PVOID unused)
+VOID CALLBACK async_start_func_win2(PVOID unused)
+{
+    async_start_func();
+}
+
+VOID CALLBACK async_start_func_win3(PVOID unused)
 {
     async_start_func();
 }
 
 Only in async_new/: tst.c

Based on the difference information in this format, the following simple analysis will be performed this time.

Make a csv file by listing the files with differences and the number of lines.
List the files that are only available in either one and make them the same csv file.
Outputs the difference information of each file to dox. Color the differences (optional)

Constitution

This time, I prepared and implemented the following three python files.

Operate docx using python-docx [SimpleDocxService class (code is here)](http://qiita.com/GDaigo/items/d5b46fc43c6250dd61b1#%E3%82%B5%E3%83%B3 % E3% 83% 97% E3% 83% AB% E3% 82% B3% E3% 83% BC% E3% 83% 89)
ParseDiff class that parses diff files (use the SimpleDocxService class above)
Diff analysis app using ParseDiff class

Of these, the operation of python-docx has already been published as an article, so I will introduce the remaining two here.

Actual code

ParseDiff class that parses diff files

First is the code for the de facto main ParseDiff class, which parses the diff file.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from docx_simple_service import SimpleDocxService

class ParseDiff:

    def __init__(self, src_codename, diff_name, cvs_name):
        self.FILE_LIST_PATH_INDEX = 0
        self.FILE_LIST_COUNT_INDEX = 1
        self.src_codename = src_codename
        self.input_diffname = diff_name
        self.cvs_name = cvs_name
        self.file_list = []
        self.only_list = []
        self.latect_diff_cnt = 0
        self.docx = None
        self.output_docxname = None
        self.print_message = True #You can make adjustments here to not send a message.

    def set_docx_param(self, docx_name, font_name, font_size, title, title_img):
        self.output_docxname = docx_name
        self.docx = SimpleDocxService()
        self.docx.set_normal_font(font_name, font_size)
        self.docx.add_head(title, 0)
        if title_img != None:
            self.docx.add_picture(title_img, 3.0)

    def adjust_return_code(self, text):
        #If you add the data of the text file as it is, a line break will occur.
        #Remove it as it will be a hassle
        text = text.replace("\n", "")
        text = text.replace("\r", "")
        return text

    def adjust_filetext(self, text):
        #If you want to put it in word, you need to make it unicode, so that process.
        #For csv only, the encoding doesn't really matter, so leave it as it is.
        if self.output_docxname != None:
            text = self.docx.get_unicode_text(text, self.src_codename)
        text = self.adjust_return_code(text)
        return text

    def mark_diff_count(self):
        #Set the count of the number of difference lines as the data of the difference information list
        #The number of difference lines is counted sequentially.
        #When processing moves to the next file or when all processing is completed
        #Call here to determine the number of diff lines.
        index = len(self.file_list) - 1
        if index >= 0:
            self.file_list[index][self.FILE_LIST_COUNT_INDEX] = self.latect_diff_cnt
        self.latect_diff_cnt = 0

    def check_word(self, text, word):
        #Whether there is a word string from the beginning of text
        if text.find(word) == 0:
            return True
        else:
            return False

    def diff_command(self, text):
        #Examine the text to see if the text of the diff command is at the beginning.
        #Whether the return value was the text of the diff command
        #The text of the diff command is passed through without any special processing.
        return self.check_word(text, "diff -r")

    def only_message(self, text):
        #Examine the text to see if Only is at the beginning.
        #Whether the return value was processed as Only.

        #The only message is
        # Only in PATH: FILENAME
        #From the above message PATH/Make a FILENAME only_Add to list
        ONLY_IN = "Only in "
        PATH_END = ": "
        if self.check_word(text, ONLY_IN) == False:
            return
        #extract path string
        start = len(ONLY_IN)
        end = text.find(PATH_END, start+1)
        if end < 0:
            return #Usually don't come here
        path = text[start:end]
        #Ask for a file name
        start = end + 1
        filename = text[start:]
        filename = filename.replace("\n", "") #Remove line breaks

        #only add to list
        self.only_list.append(path + " " + filename)
        return True

    def filename_minus(self, text):
        #Examine the text---Check if is at the beginning.
        #The return value is---Whether or not the processing was done.

        # ---Format example
        # --- async_old/async_err.c￥ｔ017-07-07 08:19:02.000000000 +0900

        #In the first place---whether
        MINUS_TOP_MESSAGE = "--- "
        start = text.find(MINUS_TOP_MESSAGE)
        if start != 0:
            return False

        #Get the last position of the pathname (see format above)
        end = text.find("\t")
        if end < 0:
            return False

        #Less than,---What to do if path is found.
        #This is the beginning of processing for each file.

        #The number of diff lines in the previous file is fixed here, so update it.
        self.mark_diff_count()

        #Added the difference file list and described the file name information.
        name = text[len(MINUS_TOP_MESSAGE):end]
        list = [name, 0]
        self.file_list.append(list)
        if self.print_message:
            print "..." + name

        #If docx is not specified, no processing is performed.
        if self.output_docxname == None:
            return True

        #Write that information to docx. Text is colored
        self.docx.add_head(u"――――――――――――――――――――――――――――――――――", 1)
        self.docx.open_text();
        text = self.adjust_filetext(text)
        self.docx.add_text_color(text, 0,0,255)
        self.docx.close_text();

        return True

    def filename_plus(self, text):
        #Examine the text+++Check if is at the beginning.
        #The return value is+++Whether or not the processing was done.
        if self.check_word(text, "+++ ") == False:
            return False

        #If docx is not specified, no processing is performed.
        if self.output_docxname == None:
            return True

        #Write to docx in color.
        self.docx.open_text();
        text = self.adjust_filetext(text)
        self.docx.add_text_color(text, 255,0,0)
        self.docx.close_text();
        return True

    def do_diff_text(self, text):
        #The difference information is processed here.

        #Encoding processing if necessary, through if there is no actual situation
        text = self.adjust_filetext(text)
        if len(text) == 0:
            return

        #If there is a difference, color code and count
        red = False
        blue = False
        if text[0] == "+":
            self.latect_diff_cnt += 1
            red = True
        elif text[0] == "-":
            blue = True
            self.latect_diff_cnt += 1

        #If docx is not specified, it is only counting, so this is the end
        if self.output_docxname == None:
            return

        #Add text if docx is specified
        self.docx.open_text();
        if red:
            self.docx.add_text_color(text, 255,0,0)
        elif blue:
            self.docx.add_text_color(text, 0,0,255)
        else:
            self.docx.add_text(text)
        self.docx.close_text();

    def parse_line(self, text):
        #Analyze line by line.
        if self.diff_command(text):
            return #Since the description of the diff command is not subject to recording, it is through

        if self.only_message(text):
            return #only Message processing

        if self.filename_minus(text):
            # "--- path1"Processing related to the description of
            return

        if self.filename_plus(text):
            # "+++ path1"Processing related to the description of
            return

        #Other than the above, write as difference information.
        self.do_diff_text(text)

    def make_cvs(self):
        #Set the difference file information to csv.

        #Writing difference information
        cvs_fp = open(self.cvs_name, "w")
        cvs_fp.write(u"diff path, lines, \r\n")
        for file_obj in self.file_list:
            if self.print_message:
                print "flle:" , file_obj
            cvs_text =  file_obj[self.FILE_LIST_PATH_INDEX] + "," + \
                        str(file_obj[self.FILE_LIST_COUNT_INDEX]) + ",\r\n"
            cvs_fp.write(cvs_text)

        #Only information, sort and then write first.
        self.only_list.sort()
        cvs_fp.write(u"only path,\r\n")
        for only in self.only_list:
            if self.print_message:
                print "only:" , only
            cvs_fp.write(only + ",\r\n")
        cvs_fp.close();

    def parse(self):
        #Main diff analysis

        #Read line by line from the file and analyze
        diff_fp = open(self.input_diffname, "r")
        while True:
            line = diff_fp.readline()
            if len(line) <= 0:
                break;
            self.parse_line(line)
        #The difference information of the last file will be confirmed here, so update it.
        self.mark_diff_count()
        diff_fp.close()

        #Save if docx output is specified
        if self.output_docxname != None:
            self.docx.save(self.output_docxname)

        #Create a CSV.
        self.make_cvs()

Sorry for the dirty code as usual. I will give a brief explanation for those who are strange to see.

First, the basic parameters are set with init and set_docx_param. Below is a brief specification of the members of the ParseDiff class.

member	Contents
src_codename	Character code("shift-jis"And)
input_diffname	The path of the text file containing the diffed result
cvs_name	CSV path to output
file_list	1 data for difference information, [path] and [number of difference lines]. List of this
only_list	A list of file paths that are only in one or the other
latect_diff_cnt	Difference count of the file currently being processed
output_docxname	Output docx path, if None, docx is not created
docx	SimpleDocxService class

The application side is supposed to write such code.

ParseDiff class generation
Call set_docx_param if dox is also issued
Parse with parse (ParseDiff handles the rest)

So, I think that you can understand the flow by looking at it from the parse function (I think that the code is too dirty and it is difficult, but Takisweat)

application

It's relatively easy because the application side just calls parse.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

#
# diff --strip-trailing-cr -r -Analyze based on the result of u3 path1 path2.
#

import sys
#from docx_simple_service import SimpleDoxService
from parse_diff import ParseDiff

if __name__ == "__main__":

    if len(sys.argv) < 3:
        print "You need docx. ->    parse_diff.py diff_name csv_name docx_name"
        print "You need csv only -> parse_diff.py diff_name csv_name"
        sys.exit(1)

    docx_name = None
    if len(sys.argv) > 3:
        docx_name = sys.argv[3]

    diff = ParseDiff("shift-jis", sys.argv[1], sys.argv[2])
    image = "report_top.png "
    diff.set_docx_param(
                docx_name,        #file name
                "Courier New",      #Font name
                8,                  #font size
                u"Difference information",         #title
                image               #Opening picture
            )
    diff.parse()

    print "complete."

The argument is set as follows.

For CSV only, diff path and CSV path
If you also issue docx, then the path of docx

The character code is fixed to shift-jis this time (I'm sorry because the use case in me was mostly the source code used in Windows). The rear image file is also fixed. If you want to change this part dynamically, I think there is a solution such as adding it to the argument or creating a separate configuration file.

Actually use

Assumptions, etc.

The recommended conditions for executing the above code are as follows.

Running on cygwin
3 pythons are in the same folder
The file name of the SimpleDocxService class must be docx_simple_service.py
The file name of the ParseDiff class must be parse_diff.py
The file name of the application must be make_diff_report.py
Place the image file report_top.png in the same folder as the above python file

This time, I made this recommended condition because of my circumstances of taking a diff and analyzing the source code on Windows with Cygwin. Of course, you can change the file name by changing the import description. You also need to replace the part that says make_diff_report in the following explanation.

I also need an image file. In this article, I used the following images of professional students as well as previous.

By the way, the material of professional student is obtained from the following, and the size and character insertion are processed. http://pronama.azurewebsites.net/pronama/

And, of course, there is also a License, so keep in mind.

How to use

First, make the difference information into an appropriate text file on Cygwin. Follow the steps below. export is unnecessary if the message is in English in the first place, and once it is done, it is unnecessary after that.


export LANG=en_US
diff -r -u3 Target folder 1 Target folder 2> diff.txt

If you do so, the difference information will be included in diff.txt, so take a quick look with an editor and see if there is a difference as described above. So, if there seems to be the desired data, I will analyze it with this python.

This time, we have prepared a method to output only CSV and a method to output docx. This is because the docx process takes time when the difference is large. If you just want statistical data, I found that it is faster to process with csv only, so I am doing this.

If you only output CSV, it looks like this.

 python make_diff_report.py diff.txt diff.csv

You will have a file called diff.csv. If you look at this in Excel, statistical data will come out like this (I am processing it on the Excel screen a little)

In this way, you will see a list of diff files and their number of lines, and a list of files that are only one of them.

Next, if you want to output docx as well, do as follows.

 python make_diff_report.py diff.txt diff.csv diff.docx

In addition to diff.csv, diff.docx is also generated. When opened in Word, it looks like this.

I was able to analyze it like that. By the way, in my case, I usually output statistics only in CSV, and some files are output in docx, and I use it to process from Word later.

license

I used it below. Thank you for providing the wonderful software.

(I will write it for the time being ...) The above code is in the public domain. It's not enough code to claim copyright. However, of course, no one will undertake any damage when using it. Just be careful there.
Python itself is a PSF (Python Software Foundation) license.
For information on ↑, see [Python on Wikipedia](https://ja.wikipedia.org/wiki/Python#.E3.83.A9.E3.82.A4.E3.82.BB.E3.83.B3. E3.82.B9) is the source.
The license for python-docx is listed below. Sounds like MIT. https://github.com/python-openxml/python-docx/blob/master/LICENSE
Please note that the images of professional students must be used in accordance with "Professional students usage guidelines".

that's all.

[Python] [Word] [python-docx] Simple analysis of diff data using python