[PYTHON] Introduction to regular expression processing system

Introduction

This is my second post, @reonyanarticle.

I was late, but I managed to finish writing it, so I will post it.

This time I will write about regular expressions. Regular expressions and natural language processing are inseparably related. There is no time to list when talking about when to use awk or grep to delete unnecessary words that exist in the sentence or to replace them. I wanted to write about such familiar regular expressions, but there are many articles about how to write regular expressions, so it's not interesting to write them. So this time, I'd like to talk a little about how regular expressions check for matching.

As a reminder, I don't talk about regular expression operators, and I use math words a lot. If you are interested, please check it out. In addition, undefined words will appear frequently, but I will gradually improve them.

Regular expression definition

In the book I referred to, "Regular expressions are a kind of" expression ". ] Is written. A regular expression is an "expression" that represents a particular set of strings.

For example, if you use only the letters a and b and only one a appears, there are a, abbbbbb, bbbbab, ab, etc., but "only one a appears" There are an infinite number of "characters". I can't write everything. However, if you use a regular expression, you can write it with one expression, b * ab *. (I recognize that it is a comprehension in a set.)

Difference in processing system

Now that we've briefly touched on the definition of regular expressions, let's look at how regular expression matching is verified. There are two types of processing systems, DFA type and VM type.

--DFA type: Converts a regular expression to a deterministic finite automaton (DFA) for matching. There is a processing system of GNU grep. --VM type: Converts a regular expression to bytecode for matching. There are processing systems for most scripting languages ​​such as Python and JavaScript.

The actual mechanism is as follows, referring to the figure in the bibliography. DFAエンジン.png

(This time, I will not explain the mechanism of VM type matching. Please read the reference and check it!)

Matching is confirmed by constructing a DFA corresponding to a regular expression and checking whether the character string given to the DFA passes or does not pass. This time, I'm going to use Python to create a DFA-type (simple) processing system! To that end, we will start with the definition of DFA that came out earlier.

DFA definitions and examples

Definition of DFA

Deterministic finite automaton quintuplet $ {\ mathcal A} = \ langle Q, \ Sigma, \ delta, s, F \ rangle $. here,

-$ Q $ is a finite set called a state -$ \ Sigma $ is a finite set of characters -$ \ delta: Q \ times \ Sigma \ rightarrow Q $ is a function called a transition function that receives a state and a character and returns the state. -$ s \ in Q $ is called the initial state -$ F \ subset Q $ is called the accepted state.

DFA example

I don't think you can tell what you're saying from the above definition alone, so let's touch on an example. Since it is difficult to handle many characters from the beginning, let's use DFA to write a regular expression called b * ab * that only one a appears, using only a and b as usable characters. ..

Let $ \ Sigma = \ {a, b \} $ and $ Q = \ {0, 1, 2 \} $. Also, set $ s = 0 $ and $ F = \ {1 \} $, and define the transition function as follows.

Now that we have defined a quintuplet, the DFA, $ {\ mathcal A} $, where only one a appears, is as follows.

automata1.dot.png

Did you somehow understand DFA by touching on the definition and example of DFA? If you don't understand, I will write a reference, so you may find out more by reading it.

Matching using DFA

Now that we have defined the DFA, let's define that the DFA accepts strings and see if it actually matches.

Definition of DFA accepting strings

Let DFA be $ {\ mathcal A} = \ langle Q, \ Sigma, \ delta, I, F \ rangle $ and the string be $ w = w_0 \ cdots w_n $$ (w_i \ in \ Sigma) $. At this time, when the state sequence $ \ {r_0, \ ldots, r_n \} \ subset Q $ that satisfies the following three conditions exists, $ {\ mathcal A} $ is said to accept $ w $.

  1. r_0 = s
  2. When $ i <n $, $ \ delta (r_i, w_i) = r_ {i + 1} $
  3. r_n \in F

Example of DFA accepting a string

I don't think it's easy to understand the definition of DFA accepting strings, so let's look at a concrete example. As before, I will touch on the regular expression b * ab *, in which only one a appears. For this regular expression, consider the string ab and the string bbbaba.

For the string ab, consider the state string $ \ {0,1 \} \ subset Q $. Then, $ \ delta (0, a) = 1 $, $ \ delta (1, b) = 1 $, and $ 0 = s $, which satisfies the condition of accepting $ 1 \ in F $. Therefore $ {\ mathcal A} $ accepts ab. This is a DFA match.

Next, regarding bbbaba, no matter what state column is fetched, it will not be accepted by $ {\ mathcal A} $. Consider $ \ {0, 0, 0, 1, 1, 2 \} \ subset Q $, which simply transitions bbbaba. $ \ Delta (0, b) = 0 $, $ \ delta (0, b) = 0 $, $ \ delta (0, b) = 0 $, $ \ delta (0, a) = 1 $, $ \ delta (1, b) = 1 $, $ \ delta (1, a) = 2 $, and $ 2 \ notin F $, so it is not satisfied. So bbbaba will not be accepted. However, the substrings bbbab and bbba are accepted. The subcolumns of the above state column, $ \ {0, 0, 0, 1, 1 \} $, $ \ {0, 0, 0, 1 \} $, each satisfy the conditions. Because. You can use this to define the shortest match as "the smallest string that DFA accepts as a substring of a string".

Now, though concrete, I was able to generate a DFA from the regular expression and use the DFA to see if the string matches the regular expression!

Implementation using Python

Now let's actually implement it using Python. DFA cannot be created without giving a concrete regular expression, so let's use b * ab * again. Actually, it would be nice if DFA could be implemented from scratch, but due to various reasons of the author, I use a module called automata. If you do your best, you can make your own, so please do your best! (I think it's a pain to make a DFA module.)

from automata.fa.dfa import DFA

dfa = DFA(
    states={'0', '1', '2'},
    input_symbols={'a', 'b'},
    transitions={
        '0': {'a': '1', 'b': '0'},
        '1': {'a': '2', 'b': '1'},
        '2': {'a': '2', 'b': '2'}
    },
    initial_state='0',
    final_states={'1'}
)


def is_accepted(words: str):
    if dfa.accepts_input(words):
        print('It will be accepted.')
    else:
        print('Not accepted.')


def check_min_matching(words: str):
    i = 0
    while i < len(words):
        if dfa.accepts_input(words[:(i+1)]):
            print(f'{words[:(i+1)]}Has been accepted.')
            break
        else:
            i = i + 1
    else:
        print(f'{words}Was not accepted.')

if __name__ == "__main__":
    is_accepted('ab')
    is_accepted('bbaba')
    check_min_matching('ab')
    check_min_matching('bbbaba')
    check_min_matching('bbb')

When I run it, the following output result is returned.

It will be accepted.
Not accepted.
a has been accepted.
bbba has been accepted.
bbb was not accepted.

You have successfully expressed b * ab * in DFA! I'll do my best to show you what I'm actually doing, so please be patient.

Finally

This time, only for b * ab *, I was able to create a processing system. I think that you can know more about regular expressions if you know through DFA how you actually check the matching of regular expressions.

By the way, can all the regular expressions be written in DFA for some readers? I think some people think that. Actually, such a theorem exists.

Kleene's theorem

For any language $ L $ on the finite character set $ \ Sigma $, the following three propositions are equivalent.

  1. Language $ L $ can be represented by a regular expression
  2. Language $ L $ can be recognized by automaton

In other words, anything that can be written in a regular expression can be accepted by an automaton, and a language that can be accepted by an automaton can be written in a regular expression. Because of this theorem, we can see that DFA can be created for operations other than *. It's amazing! (Insufficient ability to convey great things ...) Please try to make your own DFA for other operations and regular expressions with more characters! It is fun!

There are lots of interesting stories about theorems, DFA and regular expressions that I couldn't cover in this article! Please take a look at the references.

~~ Tomorrow ~~ Before I published the article on the 24th, @jyami wrote Story of making shell-like with go! Thank you for your hard work!

References

The books I read when learning formal languages ​​in the olden days are:

-Basics of Theory of Computation 1. Automata and Language

The books I used to study regular expressions this time are as follows. I referred to the figures and explanations on the way. It was a lot of fun to read, so please read it.

-Introduction to Regular Expression Technology-Latest Engine Implementation and Theoretical Background

As an aside, one of the authors of this book, Ryoma Shinya, has posted a lot of slides related to regular expressions and automata on the net, so it's interesting to see them too!

Recommended Posts

Introduction to regular expression processing system
opencv-python Introduction to image processing
[Introduction to Python3 Day 21] Chapter 10 System (10.1 to 10.5)
[Introduction to RasPi4] Environment construction; natural language processing system mecab, etc. .. .. ♪
Introduction to MQTT (Introduction)
Introduction to Scrapy (1)
Introduction to Scrapy (3)
Introduction to Supervisor
Introduction to Tkinter 1: Introduction
Regular expression Greedy
Introduction to PyQt
Introduction to Scrapy (2)
[Linux] Introduction to Linux
Introduction to Scrapy (4)
Introduction to discord.py (2)
real-time-Personal-estimation (system introduction)
Regular expression re
Introduction to discord.py
How to write regular expression patterns in Linux
[Introduction to TensorBoard] Visualize TensorFlow processing to deepen understanding
Regular expression in regex.h
0 Convert unfilled date to datetime type with regular expression
Introduction to Lightning pytorch
Introduction to Web Scraping
Introduction to Nonparametric Bayes
100 Language Processing Knock-80 (Replace with Regular Expression): Corpus Formatting
Introduction to EV3 / MicroPython
[Chapter 5] Introduction to Python with 100 knocks of language processing
An introduction to Python distributed parallel processing with Ray
Introduction to Python language
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
Regular expression look-ahead, after-yomi
python regular expression memo
Introduction to OpenCV (python)-(2)
Regular expression matching method
[Chapter 3] Introduction to Python with 100 knocks of language processing
Introduction to PyQt4 Part 1
Introduction to Dependency Injection
Regular expression in Python
Introduction to Private Chainer
Regular expression in Python
Regular expression confirmation quiz!
Introduction to machine learning
[Introduction to TensorBoard: image] TensorFlow Visualize image processing to deepen understanding
[Chapter 4] Introduction to Python with 100 knocks of language processing
Python regular expression basics and tips to learn from scratch
[Explanation for beginners] Introduction to convolution processing (explained in TensorFlow)
[Explanation for beginners] Introduction to pooling processing (explained in TensorFlow)
AOJ Introduction to Programming Topic # 1, Topic # 2, Topic # 3, Topic # 4
Introduction to electronic paper modules
A quick introduction to pytest-mock
Introduction to dictionary lookup algorithm
Introduction to Monte Carlo Method
[Learning memorandum] Introduction to vim
Introduction to PyTorch (1) Automatic differentiation
Introduction to Python Django (2) Win
Introduction to Cython Writing [Notes]
An introduction to private TensorFlow
Kubernetes Scheduler Introduction to Homebrew
An introduction to machine learning
[Introduction to cx_Oracle] Overview of cx_Oracle