You can now create docx using python-docx in previous post. This time, as an application of this, I tried to get the result of diff. I will also expose it with the meaning of a memorandum. I hope it will be helpful for those who think about managing the difference information of files.
It is a system that is premised on taking a diff on Cygwin (laugh), so I checked it in the following environment.
For installation of python-docx, etc., see [python_docx article here](http://qiita.com/GDaigo/items/d5b46fc43c6250dd61b1#python-docx%E3%81%AE%E3%82%A4%E3%83% B3% E3% 82% B9% E3% 83% 88% E3% 83% BC% E3% 83% AB) may also be helpful.
The usage of diff, which is the target analysis this time, is limited to the following.
Basically, it is assumed to be used in "diff -r -u3 <target 1> <target 2>". Just change the number after u, or compare the files alone instead of -r and it should probably work. Also, if the message is in Japanese, it must be temporarily in English as "export LANG = en_US".
When I actually do it, I get this difference (this is a part of the openssl code)
diff -r -u3 async_old/arch/async_win.c async_new/arch/async_win.c
--- async_old/arch/async_win.c 2017-07-07 08:19:02.000000000 +0900
+++ async_new/arch/async_win.c 2017-07-09 22:58:36.556937300 +0900
@@ -47,7 +47,12 @@
return 1;
}
-VOID CALLBACK async_start_func_win(PVOID unused)
+VOID CALLBACK async_start_func_win2(PVOID unused)
+{
+ async_start_func();
+}
+
+VOID CALLBACK async_start_func_win3(PVOID unused)
{
async_start_func();
}
Only in async_new/: tst.c
Based on the difference information in this format, the following simple analysis will be performed this time.
This time, I prepared and implemented the following three python files.
Of these, the operation of python-docx has already been published as an article, so I will introduce the remaining two here.
First is the code for the de facto main ParseDiff class, which parses the diff file.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from docx_simple_service import SimpleDocxService
class ParseDiff:
def __init__(self, src_codename, diff_name, cvs_name):
self.FILE_LIST_PATH_INDEX = 0
self.FILE_LIST_COUNT_INDEX = 1
self.src_codename = src_codename
self.input_diffname = diff_name
self.cvs_name = cvs_name
self.file_list = []
self.only_list = []
self.latect_diff_cnt = 0
self.docx = None
self.output_docxname = None
self.print_message = True #You can make adjustments here to not send a message.
def set_docx_param(self, docx_name, font_name, font_size, title, title_img):
self.output_docxname = docx_name
self.docx = SimpleDocxService()
self.docx.set_normal_font(font_name, font_size)
self.docx.add_head(title, 0)
if title_img != None:
self.docx.add_picture(title_img, 3.0)
def adjust_return_code(self, text):
#If you add the data of the text file as it is, a line break will occur.
#Remove it as it will be a hassle
text = text.replace("\n", "")
text = text.replace("\r", "")
return text
def adjust_filetext(self, text):
#If you want to put it in word, you need to make it unicode, so that process.
#For csv only, the encoding doesn't really matter, so leave it as it is.
if self.output_docxname != None:
text = self.docx.get_unicode_text(text, self.src_codename)
text = self.adjust_return_code(text)
return text
def mark_diff_count(self):
#Set the count of the number of difference lines as the data of the difference information list
#The number of difference lines is counted sequentially.
#When processing moves to the next file or when all processing is completed
#Call here to determine the number of diff lines.
index = len(self.file_list) - 1
if index >= 0:
self.file_list[index][self.FILE_LIST_COUNT_INDEX] = self.latect_diff_cnt
self.latect_diff_cnt = 0
def check_word(self, text, word):
#Whether there is a word string from the beginning of text
if text.find(word) == 0:
return True
else:
return False
def diff_command(self, text):
#Examine the text to see if the text of the diff command is at the beginning.
#Whether the return value was the text of the diff command
#The text of the diff command is passed through without any special processing.
return self.check_word(text, "diff -r")
def only_message(self, text):
#Examine the text to see if Only is at the beginning.
#Whether the return value was processed as Only.
#The only message is
# Only in PATH: FILENAME
#From the above message PATH/Make a FILENAME only_Add to list
ONLY_IN = "Only in "
PATH_END = ": "
if self.check_word(text, ONLY_IN) == False:
return
#extract path string
start = len(ONLY_IN)
end = text.find(PATH_END, start+1)
if end < 0:
return #Usually don't come here
path = text[start:end]
#Ask for a file name
start = end + 1
filename = text[start:]
filename = filename.replace("\n", "") #Remove line breaks
#only add to list
self.only_list.append(path + " " + filename)
return True
def filename_minus(self, text):
#Examine the text---Check if is at the beginning.
#The return value is---Whether or not the processing was done.
# ---Format example
# --- async_old/async_err.c¥t017-07-07 08:19:02.000000000 +0900
#In the first place---whether
MINUS_TOP_MESSAGE = "--- "
start = text.find(MINUS_TOP_MESSAGE)
if start != 0:
return False
#Get the last position of the pathname (see format above)
end = text.find("\t")
if end < 0:
return False
#Less than,---What to do if path is found.
#This is the beginning of processing for each file.
#The number of diff lines in the previous file is fixed here, so update it.
self.mark_diff_count()
#Added the difference file list and described the file name information.
name = text[len(MINUS_TOP_MESSAGE):end]
list = [name, 0]
self.file_list.append(list)
if self.print_message:
print "..." + name
#If docx is not specified, no processing is performed.
if self.output_docxname == None:
return True
#Write that information to docx. Text is colored
self.docx.add_head(u"――――――――――――――――――――――――――――――――――", 1)
self.docx.open_text();
text = self.adjust_filetext(text)
self.docx.add_text_color(text, 0,0,255)
self.docx.close_text();
return True
def filename_plus(self, text):
#Examine the text+++Check if is at the beginning.
#The return value is+++Whether or not the processing was done.
if self.check_word(text, "+++ ") == False:
return False
#If docx is not specified, no processing is performed.
if self.output_docxname == None:
return True
#Write to docx in color.
self.docx.open_text();
text = self.adjust_filetext(text)
self.docx.add_text_color(text, 255,0,0)
self.docx.close_text();
return True
def do_diff_text(self, text):
#The difference information is processed here.
#Encoding processing if necessary, through if there is no actual situation
text = self.adjust_filetext(text)
if len(text) == 0:
return
#If there is a difference, color code and count
red = False
blue = False
if text[0] == "+":
self.latect_diff_cnt += 1
red = True
elif text[0] == "-":
blue = True
self.latect_diff_cnt += 1
#If docx is not specified, it is only counting, so this is the end
if self.output_docxname == None:
return
#Add text if docx is specified
self.docx.open_text();
if red:
self.docx.add_text_color(text, 255,0,0)
elif blue:
self.docx.add_text_color(text, 0,0,255)
else:
self.docx.add_text(text)
self.docx.close_text();
def parse_line(self, text):
#Analyze line by line.
if self.diff_command(text):
return #Since the description of the diff command is not subject to recording, it is through
if self.only_message(text):
return #only Message processing
if self.filename_minus(text):
# "--- path1"Processing related to the description of
return
if self.filename_plus(text):
# "+++ path1"Processing related to the description of
return
#Other than the above, write as difference information.
self.do_diff_text(text)
def make_cvs(self):
#Set the difference file information to csv.
#Writing difference information
cvs_fp = open(self.cvs_name, "w")
cvs_fp.write(u"diff path, lines, \r\n")
for file_obj in self.file_list:
if self.print_message:
print "flle:" , file_obj
cvs_text = file_obj[self.FILE_LIST_PATH_INDEX] + "," + \
str(file_obj[self.FILE_LIST_COUNT_INDEX]) + ",\r\n"
cvs_fp.write(cvs_text)
#Only information, sort and then write first.
self.only_list.sort()
cvs_fp.write(u"only path,\r\n")
for only in self.only_list:
if self.print_message:
print "only:" , only
cvs_fp.write(only + ",\r\n")
cvs_fp.close();
def parse(self):
#Main diff analysis
#Read line by line from the file and analyze
diff_fp = open(self.input_diffname, "r")
while True:
line = diff_fp.readline()
if len(line) <= 0:
break;
self.parse_line(line)
#The difference information of the last file will be confirmed here, so update it.
self.mark_diff_count()
diff_fp.close()
#Save if docx output is specified
if self.output_docxname != None:
self.docx.save(self.output_docxname)
#Create a CSV.
self.make_cvs()
Sorry for the dirty code as usual. I will give a brief explanation for those who are strange to see.
First, the basic parameters are set with init and set_docx_param. Below is a brief specification of the members of the ParseDiff class.
member | Contents |
---|---|
src_codename | Character code("shift-jis"And) |
input_diffname | The path of the text file containing the diffed result |
cvs_name | CSV path to output |
file_list | 1 data for difference information, [path] and [number of difference lines]. List of this |
only_list | A list of file paths that are only in one or the other |
latect_diff_cnt | Difference count of the file currently being processed |
output_docxname | Output docx path, if None, docx is not created |
docx | SimpleDocxService class |
The application side is supposed to write such code.
So, I think that you can understand the flow by looking at it from the parse function (I think that the code is too dirty and it is difficult, but Takisweat)
It's relatively easy because the application side just calls parse.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# diff --strip-trailing-cr -r -Analyze based on the result of u3 path1 path2.
#
import sys
#from docx_simple_service import SimpleDoxService
from parse_diff import ParseDiff
if __name__ == "__main__":
if len(sys.argv) < 3:
print "You need docx. -> parse_diff.py diff_name csv_name docx_name"
print "You need csv only -> parse_diff.py diff_name csv_name"
sys.exit(1)
docx_name = None
if len(sys.argv) > 3:
docx_name = sys.argv[3]
diff = ParseDiff("shift-jis", sys.argv[1], sys.argv[2])
image = "report_top.png "
diff.set_docx_param(
docx_name, #file name
"Courier New", #Font name
8, #font size
u"Difference information", #title
image #Opening picture
)
diff.parse()
print "complete."
The argument is set as follows.
The character code is fixed to shift-jis this time (I'm sorry because the use case in me was mostly the source code used in Windows). The rear image file is also fixed. If you want to change this part dynamically, I think there is a solution such as adding it to the argument or creating a separate configuration file.
The recommended conditions for executing the above code are as follows.
This time, I made this recommended condition because of my circumstances of taking a diff and analyzing the source code on Windows with Cygwin. Of course, you can change the file name by changing the import description. You also need to replace the part that says make_diff_report in the following explanation.
I also need an image file. In this article, I used the following images of professional students as well as previous.
By the way, the material of professional student is obtained from the following, and the size and character insertion are processed. http://pronama.azurewebsites.net/pronama/
And, of course, there is also a License, so keep in mind.
First, make the difference information into an appropriate text file on Cygwin. Follow the steps below. export is unnecessary if the message is in English in the first place, and once it is done, it is unnecessary after that.
export LANG=en_US
diff -r -u3 Target folder 1 Target folder 2> diff.txt
If you do so, the difference information will be included in diff.txt, so take a quick look with an editor and see if there is a difference as described above. So, if there seems to be the desired data, I will analyze it with this python.
This time, we have prepared a method to output only CSV and a method to output docx. This is because the docx process takes time when the difference is large. If you just want statistical data, I found that it is faster to process with csv only, so I am doing this.
If you only output CSV, it looks like this.
python make_diff_report.py diff.txt diff.csv
You will have a file called diff.csv. If you look at this in Excel, statistical data will come out like this (I am processing it on the Excel screen a little)
In this way, you will see a list of diff files and their number of lines, and a list of files that are only one of them.
Next, if you want to output docx as well, do as follows.
python make_diff_report.py diff.txt diff.csv diff.docx
In addition to diff.csv, diff.docx is also generated. When opened in Word, it looks like this.
I was able to analyze it like that. By the way, in my case, I usually output statistics only in CSV, and some files are output in docx, and I use it to process from Word later.
I used it below. Thank you for providing the wonderful software.
that's all.
Recommended Posts