[PYTHON] I wrote a Japanese parser in Japanese using pyparsing.

Overview

The Python library pyparsing allows you to define grammars that are easy to read and change with a hierarchical definition bullet point. I noticed that it can be described without using if statements, and I wrote the grammar using Japanese for pyparsing classes, methods, and variable names. Compared to the English description with the same content, I felt that the code in Japanese was easier to understand at first glance. It may be overkill to translate classes and methods into Japanese, but I didn't think I could do this with Japanese in Python. Uses Python 3.7, pyparsing 2.4.6. (Anaconda distribution)

Description and execution example

Assuming three types of member lists of the organization, I defined the grammar of the parser that reads them in Japanese. Variable names (grammar name, expression name) other than Python reserved words, function names, imported class and function names, and parsing targets all use kanji. The code is a bottom-up stack of sentences from top to bottom. It can be seen that the right-hand side expression of the following sentence written at the end is the top-level grammar, and there are three types of members of the organization.

Association member=Supporting member|Student member|Individual member

I wrote bottom-up, but when writing, I decided this last definition first, then wrote the lower definition corresponding to the part, and then I took the method of breaking down from this last sentence into details. Define an expression that matches the first token of each row in the roster to branch the three most important types of members. In other words, the first match of the character string to be parsed branches depending on the company name or school name. Individual members are other than that. (Extract the code below)

company name=Backward match('Company')
Supporting member= (company name+Representative+Membership number)('Supporting member')

school name=Backward match('University') |Backward match('Technical college') |Backward match('University校')
Student member= (school name+First and last name+First and last name reading+Membership number)('Student member')

In addition, like 〇✖ Co., Ltd., the matching of company names and school names assuming continuous character strings without delimiters is performed by the suffix matching of regular expressions as described above. At first, I was thinking about the following description, but it didn't work because of the so-called greedy problem, in which the preceding formula, Kanji string (Word), was eaten up to, for example,'Co., Ltd.'. I also tried the longest match, but it didn't work. In pyparsing, tokens are consumed from the left, so there is no problem with prefix matching.

"""Binge eating causes binge eating problems"""
company name=Join(Kanji string+ oneOf('Godo Kaisha Co., Ltd.'))('company name')
"""Prefix does not cause binge eating problems"""
Company name=Join('Co., Ltd.' +Kanji string)

In order to keep the form of a list of sentences using regular expressions, a function (lambda expression) that returns a suffix Regex is provided. It is more efficient to define the suffix matching function so that multiple character strings can be passed, such as'Godo Kaisha Co., Ltd.', but it is a form in which multiple characters are described in the grammar description with a single argument. When defining the grammar with pyparsing, in order to avoid the overeating problem mentioned earlier, in addition to the workaround with regular expressions, it is possible to stop consumption by changing the input character type before and after, inserting a delimiter, etc. .. Symptomatic treatment wastes time on resolution. (Actually, I wasted two days if it wasn't like that) greedyeater.jpg

Below is the whole code, with data given and tested in the main part, including simple exception handling. However, it seems that pe.loc does not always correctly indicate the first error location, probably due to the backtrack. At first, you can write the grammar description, but if you write some, you will get the knack. In the image, the character string to be parsed is injected into the code that describes the grammar, and only those that match any expression are filtered out.

parse_OrgMemberRecordReg.py


#by T.Hayashi
#tested with Python3.7, pyparsing 2.4.6
#don't use full-width space as delimitter in this script.
from pyparsing import (
Combine as combine,
Word as column,
nums as numbers,
       __version__as version,
       Regex ,
       pyparsing_unicode as uni,
       ParseException)

#The following Japanese
def define grammar():

Backward match= lambda s : Regex(r'.*'+s)
    
integer=Column(Numbers)
Kanji string=Column(uni.Japanese.Kanji.alphas)
Kana line=Column(uni.Japanese.Hiragana.alphas)

Membership number=integer('Membership number')
    
First and last name=Kanji string('First and last name')
First and last name reading=Kana line('First and last name reading')
    
Company name=Join('Co., Ltd.' +Kanji string)
company name=Company name|Backward match('Company')
Representative=Kanji string('Representative')
Supporting member= (company name+Representative+Membership number)('Supporting member')

school name=Backward match('University') |Backward match('Technical college') |Backward match('University校')
Student member= (school name+First and last name+First and last name reading+Membership number)('Student member')

Individual member= (First and last name+First and last name reading+Membership number)('Individual member')
    
Association member=Supporting member|Student member|Individual member
return Association member

def test(gram,instr):
    try:
        r=gram.parseString(instr)
        name=r.getName()
        print(name,r.get(name))
        print()
    except ParseException as pe:
        print(f'error at {pe.loc} of {instr}')
        print(instr)
        #loc : char position.
        print(' '*(pe.loc-2)+'^')
        #print('Explain:\n',ParseException.explain(pe))


print('pyparsing version:',Version number)       
grammar=grammarを定義()

test(grammar,'Taro Yamada Yamada Taro 3456')
test(grammar,'Fictitious East University Saburo Kawasaki Kawasaki Saburo 5127')
test(grammar,'Fictitious Trading Co., Ltd. Totaro 0015') #Prefix match
test(grammar,'Fictitious Trading Co., Ltd. Taro Kaiyama 0010') #Backward match
test(grammar,'North-Northwest National College of Technology Ichiro Ito Ichiro Ito 900')
#Error confirmation High school is not in the definition
test(grammar,'Northeast High School Suzuki Saburo Suzuki Saburo 1000')
#Error confirmation: The company is missing
test(grammar,'Stock Fictitious Trading Totaro 0015')
#Check for errors Kanji for reading
test(grammar,'Ichitaro Yamada Ichitaro Yamada 3456')

The following is the execution result.

pyparsing version: 2.4.6
Individual member['Yamada Taro', 'Taro Yamada', '3456']

Student member['Fictitious East University', 'Saburo Kawasaki', 'Kawasaki Saburo', '5127']

Supporting member['Fictitious Trading Co., Ltd.', 'Totaro', '0015']

Supporting member['Fictitious Trading Co., Ltd.', 'Taro Umiyama', '0010']

Student member['North-northwest technical college', 'Ichiro Ito', 'Ichiro Ito', '900']

error at 6 of Northeast High School Suzuki Saburo Suzuki Saburo 1000
Northeast High School Suzuki Saburo Suzuki Saburo 1000
    ^
error at 7 of Stock Fictitious Trading Totaro 0015
Stock Fictitious Trading Totaro 0015
     ^
error at 9 of Ichitaro Yamada Ichitaro Yamada 3456
Ichitaro Yamada Ichitaro Yamada 3456
       ^

At the end

With a list of BNF (Backus-Naur form) -like definitions, I was able to define a grammar that is easier to understand than a normal program. I thought that if I used Japanese, various unexpected things would happen, but without that, I was surprised that Python could do so much. What I was careful about was not to bring in the problem of overeating and to avoid entering invisible double-byte spaces when entering the code. Defining and debugging a larger grammar to work correctly can be difficult to trace and may not be as easy as a regular Python program, depending on how you write it. For this reason, in the case of a student who repeatedly makes and corrects by looking at exceptions and unexpected behavior, it is necessary to strongly try to write the correct code as much as possible from the beginning compared to a normal program. .. .. Note: I called it the problem of binge eating, but it's not a general term and I tentatively named it here.

Recommended Posts

I wrote a Japanese parser in Japanese using pyparsing.
I wrote python in Japanese
A memo that I wrote a quicksort in Python
I wrote a class in Python3 and Java
I wrote a design pattern in kotlin Prototype
I wrote FizzBuzz in python using a support vector machine (library LIVSVM).
I wrote a design pattern in kotlin Factory edition
I wrote a design pattern in kotlin Builder edition
I wrote a design pattern in kotlin Singleton edition
I wrote a design pattern in kotlin Adapter edition
I wrote a design pattern in kotlin, Iterator edition
I wrote a design pattern in kotlin Template edition
I understand Python in Japanese!
I made a quick feed reader using feedparser in Python
I wrote a script to get a popular site in Japan
I wrote a script that splits the image in two
I wrote a code that exceeds 100% recovery rate in horse racing prediction using LightGBM (Part 2)
I made a Line-bot using Python!
I get a KeyError in pyclustering.xmeans
I wrote Fizz Buzz in Python
I wrote Gray Scale in Pytorch
I wrote the queue in Python
I wrote the stack in Python
I wrote a function to load a Git extension script in Python
I wrote a script to extract a web page link in Python
[Python] I wrote a REST API using AWS API Gateway and Lambda.
I wrote a code to convert quaternions to z-y-x Euler angles in Python
I want to print in a comprehension
I made a payroll program in Python!
I tried playing a ○ ✕ game using TensorFlow
Beginner: I made a launcher using dictionary
I wrote the selection sort in C
Scraping a website using JavaScript in Python
[Python] I forcibly wrote a short Perlin noise generation function in Numpy.
Draw a tree in Python 3 using graphviz
I wrote Project Euler 1 in one liner.
I tried using pipenv, so a memo
I started Node.js in a virtual environment
I created a password tool in Python.
I wrote the sliding wing in creation.
I wrote a graph like R glmnet in Python for sparse modeling in Lasso
[Fundamental Information Technology Engineer Examination] I wrote a linear search algorithm in Python.
I wrote a PyPI module that extends the parameter style in Python's sqlite3 module
Create a GIF file using Pillow in Python
When I get a chromedriver error in Selenium
Draw a graph with Japanese labels in Jupyter
I want to create a window in Python
I tried playing a typing game in Python
[Memo] I tried a pivot table in Python
Create a binary data parser using Kaitai Struct
View drug reviews using a list in Python
I tried using Pythonect, a dataflow programming language.
I tried reading a CSV file using Python
I tried adding a Python3 module in C
[PyTorch] I was a little lost in torch.max ()
Tips for using ElasticSearch in a good way
Create a MIDI file in Python using pretty_midi
Record YouTube views in a spreadsheet using Lambda
I tried using a database (sqlite3) with kivy
I made a Caesar cryptographic program in Python.
I tried to make a ○ ✕ game using TensorFlow