Read the old Gakushin DC application Word file (.doc) from Python and try to operate it.

motivation

I heard from a university clerk that it is difficult to check the Kakenhi Application Form. .. Currently, he seems to be putting red one by one by hand.

--Are you filling out according to the application guidelines? --Are the figures and references properly included? ――Is the achievement list in the proper format?

It would be nice to have an automatic check tool. I hope I can make an automatic check tool with Python! While thinking, it is a heavy load to make immediately, so I first tried reading and writing Word files from Python. As a sample according to the Grant-in-Aid for Scientific Research application, the author submitted in June 2011 Japan Society for the Promotion of Science (JSCE) I read the application form. If you are not familiar with Gakushin, you may get an emotional response when you ask a familiar doctoral student.

How to read a Word file from Python?

Read from a word file in python

There are mainly python-docx and docx2txt, both of which are .docx. Only files are supported. As we will see later, when reading a .doc file, you will need to convert it to .docx with antiword. Since docx2txt can read text from headers, footers, and hyperlinks, I mainly tried it with docx2txt this time.

environment

Install python-docx

bash


pip install python-docx

It seems that python-docx only supports up to Python 3.4, but it works with Python 3.7. I didn't get Python 3.4 in Anaconda, so I left it at 3.7.

Install docx2txt

bash


pip install docx2txt

antiword installation

As I'll explain later, the Word file I wanted to read as a sample was in .doc format instead of .docx. Cannot open .doc format files with python-docx. I feel like I lost to opening it in Word and saving it as .docx, so I tried to convert it with antiword.

Install with apt-get: failed

In conclusion, I couldn't install antiword with apt-get on Mac. I thought that antiword should be apt-get, and fink / yu-sa / items / 351969b281f3aea5e03d) is inserted, and it is said that there is no JDK during the installation of fink. I was skipped to the download page (of course installing Flash player didn't help).

bash


sudo apt-get antiword

Output result


E: Invalid operation antiword

Install with brew: Success

bash


brew install antiword

I looked at here and entered the brew command, and it was installed successfully.

bash


(base) akpro:~ kageazusa$ antiword
	Name: antiword
	Purpose: Display MS-Word files
	Author: (C) 1998-2005 Adri van Os
	Version: 0.37  (21 Oct 2005)
	Status: GNU General Public License
	Usage: antiword [switches] wordfile1 [wordfile2 ...]
	Switches: [-f|-t|-a papersize|-p papersize|-x dtd][-m mapping][-w #][-i #][-Ls]
		-f formatted text output
		-t text output (default)
		-a <paper size name> Adobe PDF output
		-p <paper size name> PostScript output
		   paper size like: a4, letter or legal
		-x <dtd> XML output
		   like: db (DocBook)
		-m <mapping> character mapping file
		-w <width> in characters of text output
		-i <level> image level (PostScript only)
		-L use landscape mode (PostScript only)
		-r Show removed text
		-s Show hidden (by Word) text

It's a 2005 tool!

Read / write test with python-docx

After copying and pasting the code in the latter half of here, it worked and I was able to create and read a Word file. There seems to be no problem with Python 3.7. In addition, when I copied and moved the code in the comment section of here, docx_simple_service could not be read. I'm guessing it's probably due to the Python version.

error


ModuleNotFoundError: No module named 'docx_simple_service'

Convert .doc file to .docx file with antiword and read with doc2txt

I will finally read the sample.

sample

Use an application like this. It was an era when Borders that have disappeared recently was active. Since it is a .doc file, it cannot be read by Python as it is. I couldn't find the final version of the Word file, so I will use the version slightly earlier than the final version that I submitted an email and had the office checked. スクリーンショット 2020-10-26 23.22.11.png

Read .doc file

I could read it immediately with the function in the answer here. Only the specified part of path has been changed slightly. I converted the .doc file to a .docx file with antiword and read it, and immediately deleted the read .docx file.

python


import os, docx2txt

def get_doc_text(filepath, file):
    
    if file.endswith('.docx'):
       text = docx2txt.process(file)
       return text
    
    elif file.endswith('.doc'):
       # converting .doc to .docx
       doc_file = os.path.join(filepath, file)
       docx_name = file + 'x'
       docx_file = os.path.join(filepath, docx_name)
        
       if not os.path.exists(docx_file):
          os.system('antiword ' + doc_file + ' > ' + docx_file)
            
          with open(docx_file) as f:
             text = f.read()
          os.remove(docx_file) #docx_file was just to read, so deleting
        
       else:
          # already a file with same name as doc exists having docx extension, 
          # which means it is a different file, so we cant read it
          print('Info : file with same name of doc exists having docx extension, so we cant read it')
          text = ''
        
       return text

I was able to read it! スクリーンショット 2020-10-27 22.45.58.png

Try to extract the contents of each subheading

I would like to use the text I read to extract the content of each subheading. In this example, the subheadings are enclosed in []. [Problem]

Text formatting

Delete line breaks, etc.

python


gakushin = get_doc_text('./sample', '110624GakushinDraftKAGE2.1.doc')
gakushin = gakushin.replace('\n', '').replace('|', '').replace('\u3000', '')

スクリーンショット 2020-10-28 19.57.37.png There is still continuous space left, so delete it while looking at here.

python


import re

#Delete continuous space to make one half-width space
gakushin = re.sub('[  ]+', ' ', gakushin)

There are places where I want space to remain, such as between et al. And the year, so I left one half-width space for the time being. Strictly speaking, it's best to remove all spaces and then replace only around et al. Or &. スクリーンショット 2020-10-28 20.06.55.png

Extraction of subheadings

Try to search for the part enclosed in []. I'm a weak person with regular expressions, so I searched and referred to here and it worked.

python


re.findall('\【.+?\】', gakushin)

Output result


['【background】',
 '【problem】',
 '[Solutions, research objectives, research methods, features and original points]',
 '[Research progress 1]',
 '[Research progress 2]',
 '[Background of future research plans]',
 '[Problems / Points to be solved]',
 '[How did you come up with the idea]',
 '【2-1】',
 '【2-2】',
 '[Refereed]',
 '[No oral presentation / peer review]',
 '[Poster presentation / peer review]',
 '[Motivation for aspiring to a research position]',
 '[Aiming researcher image]',
 '[Self-advantages, etc.]',
 '[Especially excellent academic performance and awards]',
 '[Characteristic extracurricular activities]']

The subheadings have been extracted!

Extraction of sentences under subheadings

Let's store the subheadings in a variable and use the subheadings themselves to split the text gakushin.

python


subhead = re.findall('\【.+?\】', gakushin)
text = gakushin
split_result = []

for i in range(len(subhead)):
    new_text = text.split(subhead[i])
    split_result.append(new_text[0])
    text = new_text[1]
    
#Only the last one[1]Put in
split_result.append(new_text[1])

スクリーンショット 2020-10-28 22.15.43.png I was able to divide the text into subheadings and list them. Let's check the number of elements.

python


print('Number of subheading elements', len(subhead))
print('Number of elements in the divided sentence', len(split_result))

Output result


Number of subheading elements 18
Number of elements in the divided sentence 19

Number of subheading elements + 1 = Number of elements of sentences divided by subheadings, and the calculation seems to be correct. Try storing it in a pandas DataFrame so that the subheading and the text below it match. The first element of the list split_result will be discarded.

python


import pandas as pd

df = pd.DataFrame([subhead, split_result[1:19]]).T
df.columns = ['subhead', 'text']

スクリーンショット 2020-10-28 22.22.46.png

Subheadings and the text below them have been associated. Let's count the number of characters and put it in the data frame.

python


df['length'] = df.text.apply(len)

スクリーンショット 2020-10-28 22.38.25.png The item [2-2] seems to be particularly long. Even just looking at this, it is not clear what [2-2] stands for. It looks like a research plan, but the reason why there is no [1] is unknown.

Summary

I was able to read the .doc file from Python and manipulate the text. I would like to try various things in the future.

reference

-Research Fellow | Japan Society for the Promotion of Science

Recommended Posts

Read the old Gakushin DC application Word file (.doc) from Python and try to operate it.
Python --Read data from a numeric data file to find the covariance matrix, eigenvalues, and eigenvectors
Read CSV file with Python and convert it to DataFrame as it is
[python] Send the image captured from the webcam to the server and save it
[Python] How to read data from CIFAR-10 and CIFAR-100
Python --Read data from a numeric data file and find the multiple regression line.
Fourier transform the wav file read by Python, reverse transform it, and write it again.
WEB scraping with python and try to make a word cloud from reviews
Template of python script to read the contents of the file
[Python] Try to read the cool answer to the FizzBuzz problem
Try to operate DB with Python and visualize with d3
Read the csv file and display it in the browser
Read the xml file by referring to the Python tutorial
Read json file with Python, format it, and output json
Try to make it using GUI and PyQt in Python
How to switch the configuration file to be read by Python
Try to operate an Excel file using Python (Pandas / XlsxWriter) ①
[Python] Read the csv file and display the figure with matplotlib
Try to operate an Excel file using Python (Pandas / XlsxWriter) ②
[Python] I installed the game from pip and played it
Try to decipher the garbled attachment file name with Python
Process the gzip file UNLOADed with Redshift with Python of Lambda, gzip it again and upload it to S3
[Python] Try to graph from the image of Ring Fit [OCR]
Read big endian binary in Python and convert it to ndarray
Read a file in Python with a relative path from the program
Convert the result of python optparse to dict and utilize it
Try to implement and understand the segment tree step by step (python)
Operate Firefox with Selenium from python and save the screen capture
[Python] Try to recognize characters from images with OpenCV and pyocr
Python --Get bitcoin rate BTC / JPY from bitflyer at regular intervals and save it to a file
Try to operate Facebook with Python
Pass the selected item in Tablacus Explorer from JScript to python and rename it all at once
Read and use Python files from Python
Read the data of the NFC reader connected to Raspberry Pi 3 with Python and send it to openFrameworks with OSC
Remove and retrieve arrays from fasta according to the ID list file
[Python Kivy] How to get the file path by dragging and dropping
Put Cabocha 0.68 on Windows and try to analyze the dependency with Python
Try using the Python web framework Django (1)-From installation to server startup
The file name was bad in Python and I was addicted to import
Read the file with python and delete the line breaks [Notes on reading the file]
[Implementation example] Read the file line by line with Cython (Python) from the last line
Try to find the probability that it is a multiple of 3 and not a multiple of 5 when one is removed from a card with natural numbers 1 to 100 using Ruby and Python.
From Python to using MeCab (and CaboCha)
Let's read the RINEX file with Python ①
Read the file line by line in Python
Porting and modifying doublet-solver from python2 to python3.
Try to operate Excel using Python (Xlwings)
How to operate Linux from the console
Read Python csv and export to txt
Python amateurs try to summarize the list ①
[python] Read html file and practice scraping
[Python] Read the specified line in the file
Various ways to read the last line of a csv file in Python
Try porting the "FORTRAN77 Numerical Computing Programming" program to C and Python (Part 1)
[Introduction to Pandas] Read a csv file without a column name and give it a column name
python Binary search It is surprisingly easy to implement bisect.bisect_left and bisect.bisect_right from 0
Organize the flow from granting permissions to python users to make migrations and migrating
[Python] Concatenate a List containing numbers and write it to an output file.
Try porting the "FORTRAN77 Numerical Computing Programming" program to C and Python (Part 3)
Try porting the "FORTRAN77 Numerical Computing Programming" program to C and Python (Part 2)
How to read a serial number file in a loop, process it, and graph it