[Python] [Word] [python-docx] Simple analysis of diff data using python

You can now create docx using python-docx in previous post. This time, as an application of this, I tried to get the result of diff. I will also expose it with the meaning of a memorandum. I hope it will be helpful for those who think about managing the difference information of files.

Operating environment

It is a system that is premised on taking a diff on Cygwin (laugh), so I checked it in the following environment.

For installation of python-docx, etc., see [python_docx article here](http://qiita.com/GDaigo/items/d5b46fc43c6250dd61b1#python-docx%E3%81%AE%E3%82%A4%E3%83% B3% E3% 82% B9% E3% 83% 88% E3% 83% BC% E3% 83% AB) may also be helpful.

What to do with diff

The usage of diff, which is the target analysis this time, is limited to the following.

Basically, it is assumed to be used in "diff -r -u3 <target 1> <target 2>". Just change the number after u, or compare the files alone instead of -r and it should probably work. Also, if the message is in Japanese, it must be temporarily in English as "export LANG = en_US".

When I actually do it, I get this difference (this is a part of the openssl code)

diff -r -u3 async_old/arch/async_win.c async_new/arch/async_win.c
--- async_old/arch/async_win.c	2017-07-07 08:19:02.000000000 +0900
+++ async_new/arch/async_win.c	2017-07-09 22:58:36.556937300 +0900
@@ -47,7 +47,12 @@
     return 1;
 }
 
-VOID CALLBACK async_start_func_win(PVOID unused)
+VOID CALLBACK async_start_func_win2(PVOID unused)
+{
+    async_start_func();
+}
+
+VOID CALLBACK async_start_func_win3(PVOID unused)
 {
     async_start_func();
 }
 
 Only in async_new/: tst.c

Based on the difference information in this format, the following simple analysis will be performed this time.

Constitution

This time, I prepared and implemented the following three python files.

Of these, the operation of python-docx has already been published as an article, so I will introduce the remaining two here.

Actual code

ParseDiff class that parses diff files

First is the code for the de facto main ParseDiff class, which parses the diff file.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from docx_simple_service import SimpleDocxService

class ParseDiff:

    def __init__(self, src_codename, diff_name, cvs_name):
        self.FILE_LIST_PATH_INDEX = 0
        self.FILE_LIST_COUNT_INDEX = 1
        self.src_codename = src_codename
        self.input_diffname = diff_name
        self.cvs_name = cvs_name
        self.file_list = []
        self.only_list = []
        self.latect_diff_cnt = 0
        self.docx = None
        self.output_docxname = None
        self.print_message = True #You can make adjustments here to not send a message.

    def set_docx_param(self, docx_name, font_name, font_size, title, title_img):
        self.output_docxname = docx_name
        self.docx = SimpleDocxService()
        self.docx.set_normal_font(font_name, font_size)
        self.docx.add_head(title, 0)
        if title_img != None:
            self.docx.add_picture(title_img, 3.0)

    def adjust_return_code(self, text):
        #If you add the data of the text file as it is, a line break will occur.
        #Remove it as it will be a hassle
        text = text.replace("\n", "")
        text = text.replace("\r", "")
        return text

    def adjust_filetext(self, text):
        #If you want to put it in word, you need to make it unicode, so that process.
        #For csv only, the encoding doesn't really matter, so leave it as it is.
        if self.output_docxname != None:
            text = self.docx.get_unicode_text(text, self.src_codename)
        text = self.adjust_return_code(text)
        return text

    def mark_diff_count(self):
        #Set the count of the number of difference lines as the data of the difference information list
        #The number of difference lines is counted sequentially.
        #When processing moves to the next file or when all processing is completed
        #Call here to determine the number of diff lines.
        index = len(self.file_list) - 1
        if index >= 0:
            self.file_list[index][self.FILE_LIST_COUNT_INDEX] = self.latect_diff_cnt
        self.latect_diff_cnt = 0

    def check_word(self, text, word):
        #Whether there is a word string from the beginning of text
        if text.find(word) == 0:
            return True
        else:
            return False

    def diff_command(self, text):
        #Examine the text to see if the text of the diff command is at the beginning.
        #Whether the return value was the text of the diff command
        #The text of the diff command is passed through without any special processing.
        return self.check_word(text, "diff -r")

    def only_message(self, text):
        #Examine the text to see if Only is at the beginning.
        #Whether the return value was processed as Only.

        #The only message is
        # Only in PATH: FILENAME
        #From the above message PATH/Make a FILENAME only_Add to list
        ONLY_IN = "Only in "
        PATH_END = ": "
        if self.check_word(text, ONLY_IN) == False:
            return
        #extract path string
        start = len(ONLY_IN)
        end = text.find(PATH_END, start+1)
        if end < 0:
            return #Usually don't come here
        path = text[start:end]
        #Ask for a file name
        start = end + 1
        filename = text[start:]
        filename = filename.replace("\n", "") #Remove line breaks

        #only add to list
        self.only_list.append(path + " " + filename)
        return True

    def filename_minus(self, text):
        #Examine the text---Check if is at the beginning.
        #The return value is---Whether or not the processing was done.

        # ---Format example
        # --- async_old/async_err.c¥t017-07-07 08:19:02.000000000 +0900

        #In the first place---whether
        MINUS_TOP_MESSAGE = "--- "
        start = text.find(MINUS_TOP_MESSAGE)
        if start != 0:
            return False

        #Get the last position of the pathname (see format above)
        end = text.find("\t")
        if end < 0:
            return False

        #Less than,---What to do if path is found.
        #This is the beginning of processing for each file.

        #The number of diff lines in the previous file is fixed here, so update it.
        self.mark_diff_count()

        #Added the difference file list and described the file name information.
        name = text[len(MINUS_TOP_MESSAGE):end]
        list = [name, 0]
        self.file_list.append(list)
        if self.print_message:
            print "..." + name

        #If docx is not specified, no processing is performed.
        if self.output_docxname == None:
            return True

        #Write that information to docx. Text is colored
        self.docx.add_head(u"――――――――――――――――――――――――――――――――――", 1)
        self.docx.open_text();
        text = self.adjust_filetext(text)
        self.docx.add_text_color(text, 0,0,255)
        self.docx.close_text();

        return True

    def filename_plus(self, text):
        #Examine the text+++Check if is at the beginning.
        #The return value is+++Whether or not the processing was done.
        if self.check_word(text, "+++ ") == False:
            return False

        #If docx is not specified, no processing is performed.
        if self.output_docxname == None:
            return True

        #Write to docx in color.
        self.docx.open_text();
        text = self.adjust_filetext(text)
        self.docx.add_text_color(text, 255,0,0)
        self.docx.close_text();
        return True

    def do_diff_text(self, text):
        #The difference information is processed here.

        #Encoding processing if necessary, through if there is no actual situation
        text = self.adjust_filetext(text)
        if len(text) == 0:
            return

        #If there is a difference, color code and count
        red = False
        blue = False
        if text[0] == "+":
            self.latect_diff_cnt += 1
            red = True
        elif text[0] == "-":
            blue = True
            self.latect_diff_cnt += 1

        #If docx is not specified, it is only counting, so this is the end
        if self.output_docxname == None:
            return

        #Add text if docx is specified
        self.docx.open_text();
        if red:
            self.docx.add_text_color(text, 255,0,0)
        elif blue:
            self.docx.add_text_color(text, 0,0,255)
        else:
            self.docx.add_text(text)
        self.docx.close_text();

    def parse_line(self, text):
        #Analyze line by line.
        if self.diff_command(text):
            return #Since the description of the diff command is not subject to recording, it is through

        if self.only_message(text):
            return #only Message processing

        if self.filename_minus(text):
            # "--- path1"Processing related to the description of
            return

        if self.filename_plus(text):
            # "+++ path1"Processing related to the description of
            return

        #Other than the above, write as difference information.
        self.do_diff_text(text)

    def make_cvs(self):
        #Set the difference file information to csv.

        #Writing difference information
        cvs_fp = open(self.cvs_name, "w")
        cvs_fp.write(u"diff path, lines, \r\n")
        for file_obj in self.file_list:
            if self.print_message:
                print "flle:" , file_obj
            cvs_text =  file_obj[self.FILE_LIST_PATH_INDEX] + "," + \
                        str(file_obj[self.FILE_LIST_COUNT_INDEX]) + ",\r\n"
            cvs_fp.write(cvs_text)

        #Only information, sort and then write first.
        self.only_list.sort()
        cvs_fp.write(u"only path,\r\n")
        for only in self.only_list:
            if self.print_message:
                print "only:" , only
            cvs_fp.write(only + ",\r\n")
        cvs_fp.close();

    def parse(self):
        #Main diff analysis

        #Read line by line from the file and analyze
        diff_fp = open(self.input_diffname, "r")
        while True:
            line = diff_fp.readline()
            if len(line) <= 0:
                break;
            self.parse_line(line)
        #The difference information of the last file will be confirmed here, so update it.
        self.mark_diff_count()
        diff_fp.close()

        #Save if docx output is specified
        if self.output_docxname != None:
            self.docx.save(self.output_docxname)

        #Create a CSV.
        self.make_cvs()

Sorry for the dirty code as usual. I will give a brief explanation for those who are strange to see.

First, the basic parameters are set with init and set_docx_param. Below is a brief specification of the members of the ParseDiff class.

member Contents
src_codename Character code("shift-jis"And)
input_diffname The path of the text file containing the diffed result
cvs_name CSV path to output
file_list 1 data for difference information, [path] and [number of difference lines]. List of this
only_list A list of file paths that are only in one or the other
latect_diff_cnt Difference count of the file currently being processed
output_docxname Output docx path, if None, docx is not created
docx SimpleDocxService class

The application side is supposed to write such code.

So, I think that you can understand the flow by looking at it from the parse function (I think that the code is too dirty and it is difficult, but Takisweat)

application

It's relatively easy because the application side just calls parse.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

#
# diff --strip-trailing-cr -r -Analyze based on the result of u3 path1 path2.
#

import sys
#from docx_simple_service import SimpleDoxService
from parse_diff import ParseDiff

if __name__ == "__main__":

    if len(sys.argv) < 3:
        print "You need docx. ->    parse_diff.py diff_name csv_name docx_name"
        print "You need csv only -> parse_diff.py diff_name csv_name"
        sys.exit(1)

    docx_name = None
    if len(sys.argv) > 3:
        docx_name = sys.argv[3]

    diff = ParseDiff("shift-jis", sys.argv[1], sys.argv[2])
    image = "report_top.png "
    diff.set_docx_param(
                docx_name,        #file name
                "Courier New",      #Font name
                8,                  #font size
                u"Difference information",         #title
                image               #Opening picture
            )
    diff.parse()

    print "complete."

The argument is set as follows.

The character code is fixed to shift-jis this time (I'm sorry because the use case in me was mostly the source code used in Windows). The rear image file is also fixed. If you want to change this part dynamically, I think there is a solution such as adding it to the argument or creating a separate configuration file.

Actually use

Assumptions, etc.

The recommended conditions for executing the above code are as follows.

This time, I made this recommended condition because of my circumstances of taking a diff and analyzing the source code on Windows with Cygwin. Of course, you can change the file name by changing the import description. You also need to replace the part that says make_diff_report in the following explanation.

I also need an image file. In this article, I used the following images of professional students as well as previous.

report_top.png

By the way, the material of professional student is obtained from the following, and the size and character insertion are processed. http://pronama.azurewebsites.net/pronama/

And, of course, there is also a License, so keep in mind.

How to use

First, make the difference information into an appropriate text file on Cygwin. Follow the steps below. export is unnecessary if the message is in English in the first place, and once it is done, it is unnecessary after that.


export LANG=en_US
diff -r -u3 Target folder 1 Target folder 2> diff.txt

If you do so, the difference information will be included in diff.txt, so take a quick look with an editor and see if there is a difference as described above. So, if there seems to be the desired data, I will analyze it with this python.

This time, we have prepared a method to output only CSV and a method to output docx. This is because the docx process takes time when the difference is large. If you just want statistical data, I found that it is faster to process with csv only, so I am doing this.

If you only output CSV, it looks like this.

 python make_diff_report.py diff.txt diff.csv

You will have a file called diff.csv. If you look at this in Excel, statistical data will come out like this (I am processing it on the Excel screen a little)

parse_diff_csv.JPG

In this way, you will see a list of diff files and their number of lines, and a list of files that are only one of them.

Next, if you want to output docx as well, do as follows.

 python make_diff_report.py diff.txt diff.csv diff.docx

In addition to diff.csv, diff.docx is also generated. When opened in Word, it looks like this.

parse_diff_docx.JPG

I was able to analyze it like that. By the way, in my case, I usually output statistics only in CSV, and some files are output in docx, and I use it to process from Word later.

license

I used it below. Thank you for providing the wonderful software.

that's all.

Recommended Posts

[Python] [Word] [python-docx] Simple analysis of diff data using python
Data analysis using Python 0
Data analysis using python pandas
Recommendation of data analysis using MessagePack
A simple data analysis of Bitcoin provided by CoinMetrics in Python
Data analysis python
[Python] [Word] [python-docx] Try to create a template of a word sentence in Python using python-docx
Python introductory study-output of sales data using tuples-
A well-prepared record of data analysis in Python
Data analysis with python 2
Data analysis using xarray
Data analysis overview python
Data cleaning using Python
Python data analysis template
Data analysis with Python
Explanation of the concept of regression analysis using python Part 2
Calculate the regression coefficient of simple regression analysis with python
Challenge principal component analysis of text data with Python
List of Python code used in big data analysis
Explanation of the concept of regression analysis using Python Part 1
Explanation of the concept of regression analysis using Python Extra 1
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
Summary of statistical data analysis methods using Python that can be used in business
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 4: Feature extraction of data using T-SQL
My python data analysis container
Python for Data Analysis Chapter 4
Static analysis of Python programs
[Python] Notes on data analysis
python: Basics of using scikit-learn ①
# 1 [python3] Simple calculation using variables
Python data analysis learning notes
Simple FPS measurement of python
Python for Data Analysis Chapter 2
Simple regression analysis in Python
Python for Data Analysis Chapter 3
[Introduction] Artificial satellite data analysis using Python (Google Colab environment)
[Python] I tried collecting data using the API of wikipedia
I studied four libraries of Python 3 engineer certified data analysis exams
Image capture of firefox using python
First simple regression analysis in Python
Data acquisition using python googlemap api
Python: Time Series Analysis: Preprocessing Time Series Data
Practical exercise of data analysis with Python ~ 2016 New Coder Survey Edition ~
Removal of haze using Python detailEnhanceFilter
Basic map information using Python Geotiff conversion of numerical elevation data
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
Preprocessing template for data analysis (Python)
Implementation of desktop notifications using Python
Time series analysis 3 Preprocessing of time series data
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Data handling 2 Analysis of various data formats
Sentiment analysis of corporate word-of-mouth data of career change meetings using deep learning
A summary of Python e-books that are useful for free-to-read data analysis
I tried to perform a cluster analysis of customers using purchasing data
[Python] Extension using inheritance of matplotlib (NavigationToolbar2TK)
Automatic collection of stock prices using python
About building GUI using TKinter of Python
Python visualization tool for data analysis work
(Bad) practice of using this in Python
Machine learning with python (2) Simple regression analysis