[PYTHON] Learn Pyparsing

What is Pyparsing?

The pyparsing module is an alternative approach to creating and executing simple grammars, vs. the traditional lex/yacc approach, or the use of regular expressions. The pyparsing module provides a library of classes that client code uses to construct the grammar directly in Python code. pyparsing - home

That's right. There is also a book by O'Reilly, so I think the rich should buy it. I searched for it and wrote it because there seems to be no information in Japanese. I'm trying it in a Python 2.7 environment, so it may not be modern. I used 2.0.2 as the version of pyparsing.

How to use

that's all.

Make a parser

There are two ways to make a parser. One is to create an instance of a ParserElement derived class, and the other is to create an appropriate ParserElement using a generator function from a module.

import pyparsing as pp

word = pp.Word(pp.alphanums)
comma = pp.Literal(',')
csv = word + pp.ZeroOrMore(comma + word)
# >>> csv.parseString("1,2,3,a,5")
(['1', ',', '2', ',', '3', ',', 'a', ',', '5'], {})

Some derived classes take a string or regular expression as an argument for the instance, and some take the pyparsing class itself as an argument.

--Take arguments or zero: Word, Literal, etc. --Classes: ʻOneOrMore, NotAny, SkipTo`, etc.

Some operators are provided for those that take (a list of) classes. There is no corresponding class, but you can express repetition with *. The number to be multiplied and the number to be multiplied may be interchanged, but one must be ʻint of 0 or more. (Long` is not accepted)

pp.And([p1, p2])        #== p1 + p2        ;Ordered join
pp.Each([p1, p2])       #== p1 & p2        ;Unordered join
pp.MatchFirst([p1, p2]) #== p1 | p2        ;Priority match
pp.Or([p1, p2])         #== p1 ^ p2        ;Longest match
pp.NotAny(p)            #== ~ p            ;denial
p + p + p               #== p * 3 or 3 * p ;Abbreviation for binding

It may not be a very clever nomenclature, but you have to get used to it. Also, ʻEachis tried in order from the beginning without backtracking even though it is out of order, so if there is a part that overlaps each element, the input cannot be eaten and an exception is issued. If you really want to use the union, useMatch First` moderately.

Generator functions are originally specialized for specific purposes and are internally composed of regular expressions. If all this happens, you should think that the design policy of the parser is wrong. I will omit it here.

example

Try parsing the $ GPGSV line of NMEA 0183 output from the GPS receiver. $ GPGSV has the following structure. $GPGSV,3,2,12,16,02,229,,22,21,224,16,24,02,095,,25,52,039,35*73\r\n

It starts with the start character of $ and ends with the end sequence of \ r \ n. A checksum in *% 2h format is inserted before the end sequence. The rest contains 7 to 16 comma-separated values (possibly empty). The total number cannot be determined from the message. The message is divided and sent, and the amount of information to be transmitted is described, so it can be determined by calculation, but this time it will not be carried out so far. (The second field is the total number of messages, the third is the current message number, and the fourth is the amount of information (number of satellites))

buf = '$GPGSV,3,2,12,16,02,229,,22,21,224,16,24,02,095,,25,52,039,35*73\r\n'
# checksum (for validation)
ck = "{:02X}".format(reduce(lambda a,b: a^b, [ord(_) for _ in buf[1+buf.find("$"):buf.rfind("*")]]))
# simple parser unit
toint  = lambda a: int(a[0])
c      = pp.Literal(',').suppress()
nemo   = pp.Literal('$GPGSV')
numMsg = pp.Word(pp.nums).setParseAction(toint)
msgNum = pp.Word(pp.nums).setParseAction(toint).setResultsName("msgNum")
numSV  = pp.Word(pp.nums).setParseAction(toint).setResultsName("numSV")
sv     = pp.Word(pp.nums).setParseAction(toint)
# combinated parser unit
toint_maybe = lambda a: int(a[0]) if a[0] else -1
elv    = pp.Combine(pp.Word(pp.nums) ^ pp.Empty()).setParseAction(toint_maybe)
az     = pp.Combine(pp.Word(pp.nums) ^ pp.Empty()).setParseAction(toint_maybe)
cno    = pp.Combine(pp.Word(pp.nums) ^ pp.Empty()).setParseAction(toint_maybe)
cs     = pp.Combine(pp.Literal('*') + pp.Literal(ck)).suppress()
block  = pp.Group(c + sv + c + elv + c + az + c + cno)
blocks = pp.OneOrMore(block)
parser = nemo + c + numMsg + c + msgNum + c + numSV + blocks + cs
# result
ret = parser.parseString(buf)

The only thing to remember is Word ("ab ... ") == Each ([Literal ('a'), Literal ('b'), ...]). Since Literal can contain characters with a length of 1 or more, it is recommended to eat fixed character strings with this and consume numerical values with Word.

The result parsed to ret is returned. For the time being, the one that can be converted to an integer is converted to an integer with setParseAction. Otherwise, it simply returns a list of strings. If you want to eat the buffer but don't want it to remain in the result, you can remove it by using the .suppress () method or wrapping the element with pp.Suppress. It's not used effectively, but it can be named with the .setResultsName method.

As a convenient or addictive place

The value obtained by the final parseString method is a list, but the break of this element is each ParserElement or the unit explicitly grouped by Combine. Therefore, if you simply combine using ʻAnd, the correspondence between the formula for assembling the final parser and the result will be strange. You should do Combine` moderately.

It's also related to combining Combine, but if you don't do anything, the result will be a list of strings. It's not nested, so you won't know the result on days when you're badly Zero Or More. Use Group when you want to nest. If you think about using setParseAction, you will definitely use it.

So you can't write it declaratively, but you can create a method chain instead.

According to the document, some prototypes are set, but for each prototype, set arguments and throw, if it fails, try the next, and if it is an error to the end, the last error is returned as the whole error, so it is badly internally When an exception occurs, it is extremely difficult to debug. Even if the argument processing is wrong, the last error will be a prototype error with insufficient arguments, so you will not know what you are saying.

that's all.

Recommended Posts

Learn Pyparsing
Learn data science
Learn python gestures