Interactively output BPE using python curses

It uses curses, a library for creating TUI (text user interface), to output the learning progress of BPE in a nice way.

The whole code is uploaded to gist. :arrow_right: bpe_curses.py

environment

BPE

(If you just want to know about curses, skip it)

What is BPE

Byte Pair Encoding is a technique that is also used in Sentencepiece, which is a tokenizer for neural machine translation. The first appearance was Neural Machine Translation of Rare Words with Subword Units (ACL2016), and the implementation is also described in the paper.

For example, if you have words like lower, newer, wider, you can reduce the number of vocabulary by treating the frequent ʻe r as one symbol ʻer.

As you can see, although it is famous as a subword division algorithm in NLP, it is a data compression method in the first place, and it is called [Byte pair encoding (Wikipedia)](https://ja.wikipedia.org/wiki/Byte pair encoding) The principle is also introduced.

This time, based on the code of the paper, the compression progress is output.

BPE implementation overview

For the implementation of BPE, I just used the code of the paper as it is and added a type hint.

The implementation consists of two main functions, get_status () and merge_vocab ().

--get_status takes a vocal dictionary and checks the frequency of word combinations. -defaultdict is used to handle combinations that are not in the key.

  def get_stats(vocab: Dict) -> DefaultDict:
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i], symbols[i+1]] += freq
    return pairs

--In merge_vocab, among the combinations examined by get_status, the most frequent combination is merged to treat it as one word.

def merge_vocab(pair: List, v_in: Dict) -> Dict:
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!<\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

This time, we will display the state transition of the word after merge_vocab.

Curses

What is Curses?

The curses library provides terminal-independent screen drawing and keyboard processing for text-based terminals (terminals) such as VT100s, Linux consoles, and emulation terminals provided by various programs. From Curses Programming in Python

curses is a standard python module. (It doesn't seem to be included in the windows version ...) If you use curses, you can easily create something like a CUI application.

For example, life.py in the python demo can be found in [Life Games](https: // ja.wikipedia.org/wiki/Lifegame) code. lifepy.gif

How to use Curses

We will implement the display of state transitions with curses.

Let's use wrapper

As described in Curses Programming in Python (https://docs.python.org/en/3/howto/curses.html), curses.wrapper to avoid error handling and initialization complexity. Use the () function.

import curses

def main(stdscr):
	#Call curses processing with stdscr

if __name__ == '__main__':
    curses.wrapper(main)

Basic operation

The basic processing flow is as follows.

--stdscr.addstr (str): Add text str to the current position --stdscr.refresh (): Refresh display ――Display the text that has been addedstr --stdscr.getkey ()`: Accept keystrokes --Waiting role (otherwise the program will end)

for i in range(10):
    stdscr.addstr('{}\n'.format(i))
    stdscr.refresh()
    stdscr.getkey()

In the case of the above code, the number is displayed and the waiting state is repeated.

Prevents off-screen errors

Attempting to display in a range longer than the height of the screen will result in an error.

To prevent errors, it is necessary to first obtain the current display size and then devise not to specify outside the display range. You can get the size with getmaxyx ().

stdscr_y, stdscr_x = stdscr.getmaxyx()

Ingenuity of display

If it is left as it is, it will not taste good, so I will try to devise a display.

This time, if you merge the letters, the words will be bold. Specifically, add the attribute information curses.A_BOLD to ʻaddstr`.

You can also color or blink it. The actual attributes and execution results of Attributes and Colors are as follows.

stdscr.addstr('This is A_BOLD\n', curses.A_BOLD)
stdscr.addstr('This is A_BLINK\n', curses.A_BLINK)
stdscr.addstr('This is A_DIM\n', curses.A_DIM)
stdscr.addstr('This is A_STANDOUT\n', curses.A_STANDOUT)
stdscr.addstr('This is A_REVERSE\n', curses.A_REVERSE)
stdscr.addstr('This is A_UNDERLINE\n\n', curses.A_UNDERLINE)
#Specify the background and text color
curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
stdscr.addstr("This is curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)\n", curses.color_pair(1))

Output result

When you execute bpe_curses.py created in consideration of the above, the updated result is output by merging each time you hit the key as shown below. Will be done.

bpe_sample.gif

It's small and a little confusing, but it's a word that contains pairs with ** bold ** merged. Initially, newer and wider are in bold because ʻeandr` are the most frequent pairs. (2nd line)

Also, by repeating merge 10 times, you can see that the number of vocabulary displayed on the far left has decreased. (14 → 6)

I don't think there was much need to be interactive this time, but curses can receive keystrokes, so I feel that it can be used in various ways depending on the device. It's easier than implementing a GUI, so it might be nice to show a little output.

reference

--Document

Recommended Posts

Interactively output BPE using python curses
python learning output
Start using Python
Scraping using Python
Operate Redmine using Python Redmine
Fibonacci sequence using Python
Data analysis using Python 0
Data cleaning using Python
python input and output
Using Python #external packages
Python audio input / output
Output python execution time
WiringPi-SPI communication using Python
Age calculation using python
Japanese output in Python
Search Twitter using Python
Name identification using python
Notes using Python subprocesses
Basics of python: Output
Using a Python program with fluentd's exec_filter Output Plugin
Try using Tweepy [Python2.7]
Using a python program with fluentd's exec Output Plugin
Output Excel data in separate writing using Python3 + xlrd + mecab
Output product information to csv using Rakuten product search API [Python]
Python notes using perl-ternary operator
Flatten using Python yield from
Scraping using Python 3.5 async / await
Save images using python3 requests
[S3] CRUD with S3 using Python [Python]
[Python] Try using Tkinter's canvas
Using Quaternion with Python ~ numpy-quaternion ~
Try Python output with Haxe 3.2
Try using Kubernetes Client -Python-
Python notes using perl-special variables
[Python] Using OpenCV with Python (Basic)
Scraping using Python 3.5 Async syntax
Website change monitoring using python
Petit stray Python version output
Start to Selenium using python
Search algorithm using word2vec [python]
python: Basics of using scikit-learn ①
# 1 [python3] Simple calculation using variables
Create JIRA tickets using Python
Instrument control using Python [pyvisa]
Manipulate spreadsheets locally using Python
Python memo using perl --join
Web scraping using Selenium (Python)
[Python] I tried using OpenPose
[Python] JSON validation using Voluptuous
Broadcast on LINE using python
Data analysis using python pandas
Translate using googletrans in Python
Using Python mode in Processing
Using OpenCV with Python @Mac
Read Fortran output in python
[Python] Shooting game using pyxel
Send using Python with Gmail
[Python] Conversation using OpenJTalk and Talk API (up to voice output)
A note on using tab completion when running Python interactively on Windows