[PYTHON] I wrote a Japanese parser in Japanese using pyparsing.

Overview

The Python library pyparsing allows you to define grammars that are easy to read and change with a hierarchical definition bullet point. I noticed that it can be described without using if statements, and I wrote the grammar using Japanese for pyparsing classes, methods, and variable names. Compared to the English description with the same content, I felt that the code in Japanese was easier to understand at first glance. It may be overkill to translate classes and methods into Japanese, but I didn't think I could do this with Japanese in Python. Uses Python 3.7, pyparsing 2.4.6. (Anaconda distribution)

Description and execution example

Assuming three types of member lists of the organization, I defined the grammar of the parser that reads them in Japanese. Variable names (grammar name, expression name) other than Python reserved words, function names, imported class and function names, and parsing targets all use kanji. The code is a bottom-up stack of sentences from top to bottom. It can be seen that the right-hand side expression of the following sentence written at the end is the top-level grammar, and there are three types of members of the organization.

Association member=Supporting member|Student member|Individual member

I wrote bottom-up, but when writing, I decided this last definition first, then wrote the lower definition corresponding to the part, and then I took the method of breaking down from this last sentence into details. Define an expression that matches the first token of each row in the roster to branch the three most important types of members. In other words, the first match of the character string to be parsed branches depending on the company name or school name. Individual members are other than that. (Extract the code below)

company name=Backward match('Company')
Supporting member= (company name+Representative+Membership number)('Supporting member')

school name=Backward match('University') |Backward match('Technical college') |Backward match('University校')
Student member= (school name+First and last name+First and last name reading+Membership number)('Student member')

In addition, like 〇✖ Co., Ltd., the matching of company names and school names assuming continuous character strings without delimiters is performed by the suffix matching of regular expressions as described above. At first, I was thinking about the following description, but it didn't work because of the so-called greedy problem, in which the preceding formula, Kanji string (Word), was eaten up to, for example,'Co., Ltd.'. I also tried the longest match, but it didn't work. In pyparsing, tokens are consumed from the left, so there is no problem with prefix matching.

"""Binge eating causes binge eating problems"""
company name=Join(Kanji string+ oneOf('Godo Kaisha Co., Ltd.'))('company name')
"""Prefix does not cause binge eating problems"""
Company name=Join('Co., Ltd.' +Kanji string)

In order to keep the form of a list of sentences using regular expressions, a function (lambda expression) that returns a suffix Regex is provided. It is more efficient to define the suffix matching function so that multiple character strings can be passed, such as'Godo Kaisha Co., Ltd.', but it is a form in which multiple characters are described in the grammar description with a single argument. When defining the grammar with pyparsing, in order to avoid the overeating problem mentioned earlier, in addition to the workaround with regular expressions, it is possible to stop consumption by changing the input character type before and after, inserting a delimiter, etc. .. Symptomatic treatment wastes time on resolution. (Actually, I wasted two days if it wasn't like that)

Below is the whole code, with data given and tested in the main part, including simple exception handling. However, it seems that pe.loc does not always correctly indicate the first error location, probably due to the backtrack. At first, you can write the grammar description, but if you write some, you will get the knack. In the image, the character string to be parsed is injected into the code that describes the grammar, and only those that match any expression are filtered out.

`parse_OrgMemberRecordReg.py`


#by T.Hayashi
#tested with Python3.7, pyparsing 2.4.6
#don't use full-width space as delimitter in this script.
from pyparsing import (
Combine as combine,
Word as column,
nums as numbers,
       __version__as version,
       Regex ,
       pyparsing_unicode as uni,
       ParseException)

#The following Japanese
def define grammar():

Backward match= lambda s : Regex(r'.*'+s)
    
integer=Column(Numbers)
Kanji string=Column(uni.Japanese.Kanji.alphas)
Kana line=Column(uni.Japanese.Hiragana.alphas)

Membership number=integer('Membership number')
    
First and last name=Kanji string('First and last name')
First and last name reading=Kana line('First and last name reading')
    
Company name=Join('Co., Ltd.' +Kanji string)
company name=Company name|Backward match('Company')
Representative=Kanji string('Representative')
Supporting member= (company name+Representative+Membership number)('Supporting member')

school name=Backward match('University') |Backward match('Technical college') |Backward match('University校')
Student member= (school name+First and last name+First and last name reading+Membership number)('Student member')

Individual member= (First and last name+First and last name reading+Membership number)('Individual member')
    
Association member=Supporting member|Student member|Individual member
return Association member

def test(gram,instr):
    try:
        r=gram.parseString(instr)
        name=r.getName()
        print(name,r.get(name))
        print()
    except ParseException as pe:
        print(f'error at {pe.loc} of {instr}')
        print(instr)
        #loc : char position.
        print('　'*(pe.loc-2)+'^')
        #print('Explain:\n',ParseException.explain(pe))


print('pyparsing version:',Version number)       
grammar=grammarを定義()

test(grammar,'Taro Yamada Yamada Taro 3456')
test(grammar,'Fictitious East University Saburo Kawasaki Kawasaki Saburo 5127')
test(grammar,'Fictitious Trading Co., Ltd. Totaro 0015') #Prefix match
test(grammar,'Fictitious Trading Co., Ltd. Taro Kaiyama 0010') #Backward match
test(grammar,'North-Northwest National College of Technology Ichiro Ito Ichiro Ito 900')
#Error confirmation High school is not in the definition
test(grammar,'Northeast High School Suzuki Saburo Suzuki Saburo 1000')
#Error confirmation: The company is missing
test(grammar,'Stock Fictitious Trading Totaro 0015')
#Check for errors Kanji for reading
test(grammar,'Ichitaro Yamada Ichitaro Yamada 3456')

The following is the execution result.

pyparsing version: 2.4.6
Individual member['Yamada Taro', 'Taro Yamada', '3456']

Student member['Fictitious East University', 'Saburo Kawasaki', 'Kawasaki Saburo', '5127']

Supporting member['Fictitious Trading Co., Ltd.', 'Totaro', '0015']

Supporting member['Fictitious Trading Co., Ltd.', 'Taro Umiyama', '0010']

Student member['North-northwest technical college', 'Ichiro Ito', 'Ichiro Ito', '900']

error at 6 of Northeast High School Suzuki Saburo Suzuki Saburo 1000
Northeast High School Suzuki Saburo Suzuki Saburo 1000
　　　　^
error at 7 of Stock Fictitious Trading Totaro 0015
Stock Fictitious Trading Totaro 0015
　　　　　^
error at 9 of Ichitaro Yamada Ichitaro Yamada 3456
Ichitaro Yamada Ichitaro Yamada 3456
　　　　　　　^

At the end

With a list of BNF (Backus-Naur form) -like definitions, I was able to define a grammar that is easier to understand than a normal program. I thought that if I used Japanese, various unexpected things would happen, but without that, I was surprised that Python could do so much. What I was careful about was not to bring in the problem of overeating and to avoid entering invisible double-byte spaces when entering the code. Defining and debugging a larger grammar to work correctly can be difficult to trace and may not be as easy as a regular Python program, depending on how you write it. For this reason, in the case of a student who repeatedly makes and corrects by looking at exceptions and unexpected behavior, it is necessary to strongly try to write the correct code as much as possible from the beginning compared to a normal program. .. .. Note: I called it the problem of binge eating, but it's not a general term and I tentatively named it here.