TRIE implementation by Python-Double array (with Tail)-

Overview

Implement a double array on Python and measure its memory usage and search time. When evaluating the performance, I searched variously, but I could not find a convincing implementation, so I decided to make it myself. Originally, we should consider the efficiency of the process of creating a double array, but this time we will not focus on that. The purpose is to evaluate the performance of the completed double array. (Therefore, dynamic addition is also excluded.) Also, the operation with files is not considered. Create it in memory and measure its performance there.

TRIE structure

It is one of the methods to manage the character string data, and the search can be realized at an effective speed. Although it has a simple structure in theory, it may not be very effective depending on the implementation method.

-Tri-tree (Wikipedia)

"Double array" is one of the implementation methods of the TRIE structure and can realize high-speed search.

Double array

For the double array, please refer to the easy-to-understand explanations in various places.

-Double array that can be understood by the Master of Information Systems -How to implement double array (slideshare) -Double array trivia (slideshare) -[How to learn string algorithm](https://developer.hatenastaff.com/entry/2016/12/22/210006#6-%E3%83%88%E3%83%A9%E3%82% A4% E6% 9C% A8-Trie-% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E8% A4% 87% E6% 95% B0% E3% 83% 91% E3% 82 % BF% E3% 83% BC% E3% 83% B3% E3% 83% 9E% E3% 83% 83% E3% 83% 81% E3% 83% B3% E3% 82% B0) -I wrote the double array TRIE several times in my life (Qiita) -Realization method of tri-search by double array method (Paper by Professor Junichi Aoe)

Implementation in Python

I've implemented "double arrays" in other programming languages (C, C ++, Java, etc.) before, but this is my first implementation in Python. I think there is an efficiency improvement method peculiar to Python, but I will implement it according to the basic algorithm. I used the basic data structures (List, Array, etc.) that the language has, but created the algorithm implementation in full scratch.

If you can't see the whole picture of the link in one node, you can't check where it can be stored in the array, so first, do a simple (speed-free) TRIE implementation so that you can see the whole link, and then do that. Try to store in a double array.

In this implementation of double array, Tail is taken into consideration. It is possible to represent all strings in a TRIE structure. Doing so simplifies the algorithm. However, strings near the end have no branches, creating links with connected nodes, resulting in higher memory and time costs. Therefore, we can improve this by managing the terminating string without branches as Tail. In other words, we implemented a double array with tail from the viewpoint of implementing better performance in performance evaluation.

-Program Source (Github)

Performance evaluation

--Performance measurement method --Measure exact match search Try 100 times and get the average value

--Execution environment - Windows 10 Pro - Intel(R) Core(TM)i7-75600U CPU @ 2.40GHz - RAM 16GB - Python 3.7.1

--Search word --87,379 English words obtained from wordnet

--Execution time & memory used --Number of search words --50,000 cases --Result: Execution time --0.281 seconds --Result: Memory used - 7.5MB

Future improvements

--Addition of prefix search ――This time, we only search for records that match the character string, but when considering other applications, there will be situations where prefix search is required. --Performance improvement ――Since it is an implementation at the level of "I made it for the time being", I think there is room for efficiency improvement peculiar to Python. I would like to aim for further performance improvement.

Recommended Posts

TRIE implementation by Python-Double array (with Tail)-
Merge array with PyYAML
Save the output of GAN one by one ~ With the implementation of GAN by PyTorch ~