I wrote a tri-tree that can be used for high-speed dictionary implementation in D language and Python.

What is a trie tree?

Do you know the data structure called Trie tree? Tri-tree is a data structure that can be used to implement high-speed dictionaries. Implementing using a data structure called LOUDS has the advantage that the tree can be represented with very little memory. Tri-tree is used to keep the Japanese dictionary in memory, for example, in kana-kanji conversion (Google Japanese input, etc.).

The trie tree has this shape. trie tree (The figure is taken from Wikipedia)

What makes it different from other trees is that the "branches" of the tree are labeled.

How can I use this as a dictionary? Let's search for "in". it's simple. All you have to do is follow i-> n. You can retrieve "5" as a search result. Let's search for "inn". All you have to do is move down from the current "5" position. "9" can be retrieved as a result.

Trie tree features

You can search at high speed

What's great about this trie tree is that you can search very fast. In a tri-tree, the search speed is proportional to the size of the key to search. And it's not proportional to the size of the __ tree! __

For example, suppose you implement a Japanese dictionary with a tri-tree. Searching for the 12-letter "Waraukado ni Fukukitaru" in the Japanese dictionary takes about 4 (= 12/3) times longer than searching for the 3-letter "Warau". However, if you look up the same word, whether the dictionary contains 100 words or 1 million words, the search speed will theoretically be the same. It's amazing.

Reduce wasteful searches

Also, when searching for a character string that has a common part, you can reduce unnecessary searches by restarting the search from the middle. For example, when I searched for "inn" earlier, I was able to quickly find the search results for "inn" by restarting the search from the node "5" in the result of searching for "in".

Implementation by LOUDS

Implementing a tri-tree without thinking about it using structures and pointers consumes a lot of memory. The machine freezes if the memory allocation is too busy. Therefore, we use a data representation method called LOUDS. With LOUDS, one node can be represented by only 2 bits. A tree with 11 nodes like the one shown at the beginning can be represented by only 23 bits, and it fits in one long type variable.

I would like to explain LOUDS here, but since Explanation of succinct data structure LOUDS was very easy to understand, I will give it to you. ..

code

The code is below. Implemented in D language and Python respectively. github.com/IshitaTakeshi/Louds-Trie I've written a lot of comments, so I think you can understand it by reading the LOUDS explanation at the link above and then reading the code. If you have any questions or concerns, please comment or tell us on twitter.

Boxed edition

It can be handled by the D language package management system DUB. DUB page Source code

Recommended Posts

I wrote a tri-tree that can be used for high-speed dictionary implementation in D language and Python.
I created a template for a Python project that can be used universally
I made a familiar function that can be used in statistics with Python
Understand the probabilities and statistics that can be used for progress management with a python program
A memo that I wrote a quicksort in Python
Functions that can be used in for statements
I wrote a class in Python3 and Java
I want to create a priority queue that can be updated in Python (2.7)
Easy program installer and automatic program updater that can be used in any language
Scripts that can be used when using bottle in Python
Can be used with AtCoder! A collection of techniques for drawing short code in Python!
I made a tool to automatically generate a state transition diagram that can be used for both web development and application development
A timer (ticker) that can be used in the field (can be used anywhere)
I made a shuffle that can be reset (reverted) with Python
New features in Python 3.9 (1)-Union operators can be used in dictionary types
Python standard input summary that can be used in competition pro
I made it because I want JSON data that can be used freely in demos and prototypes
What I learned and coded for a function that opens a special Windows folder in Python3 ctypes
I bought and analyzed the year-end jumbo lottery with Python that can be executed in Colaboratory
Easy padding of data that can be used in natural language processing
Mathematical optimization that can be used for free work with Python + PuLP
I wrote a graph like R glmnet in Python for sparse modeling in Lasso
++ and-cannot be used for increment / decrement in python
I made a python dictionary file for Neocomplete
[Python3] Code that can be used when you want to cut out an image in a specific size
A class for PYTHON that can be operated without being aware of LDAP
I tried to create a class that can easily serialize Json in Python
I registered PyQCheck, a library that can perform QuickCheck with Python, in PyPI.
A personal memo of Pandas related operations that can be used in practice
Note that I understand the least squares algorithm. And I wrote it in Python.
How to install a Python library that can be used by pharmaceutical companies
I made a tool in Python that right-clicks an Excel file and divides it into files for each sheet.
Summary of statistical data analysis methods using Python that can be used in business
Geographic information visualization of R and Python that can be expressed in Power BI
Set up an FTP server that can be created and destroyed immediately (in Python)
A mechanism to call a Ruby method from Python that can be done in 200 lines
Basic algorithms that can be used in competition pros
Japanese can be used with Python in Docker environment
I made a VM that runs OpenCV for Python
Python knowledge notes that can be used with AtCoder
ANTs image registration that can be used in 5 minutes
Can be used in competition pros! Python standard library
I wrote python3.4 in .envrc with direnv and allowed it, but I got a syntax error
Install Mecab and CaboCha on ubuntu16.04LTS so that it can be used from python3 series
How to set up a simple SMTP server that can be tested locally in Python
I wrote a miscellaneous Ansible module that enables Virtualenv to be used by installing Pythonz.
[Django] Field names, user registration, and login methods that can be used in the User model
[Python3] Code that can be used when you want to resize images in folder units
[Atcoder] [C ++] I made a test automation tool that can be used during the contest
I made a data extension class for tensorflow> = 2.0 because ImageDataGenerator can no longer be used.
[Python] A program to find the number of apples and oranges that can be harvested
Goroutine (parallel control) that can be used in the field
I tried "a program that removes duplicate statements in Python"
Goroutine that can be used in the field (errgroup.Group edition)
Create code that outputs "A and pretending B" in python
I created a class in Python and tried duck typing
I wrote a script that splits the image in two
[Python] Draw elevation data on a sphere with Plotly and draw a globe that can be rotated round and round
I wrote python in Japanese
Create a dictionary in Python
[For beginners] Baseball statistics and PyData that can be remembered in 33 minutes and 4 seconds ~ With Dai-Kang Yang