Count the number of Thai and Arabic characters well in Python

Unicode difficult

There are many difficulties in handling Unicode. I've been studying a lot lately. So there may be terrible mistakes by Unicode beginners in the following:

I knew about the confusing points of Unicode normalization differences (NFC, NFD, NFKC, NFKD), In another layer, when counting Thai characters, Arabic characters, Devanagari characters, etc. visually, it seems necessary to count in a higher layer called Grapheme.

Reference: 7 ways to count the number of characters

Grapheme

In other words

--If you count the number of characters normally in a programming language, it will be the number of Code points. --Actually, one character may be visually composed of multiple Code points. --The visually correct single character unit is Grapheme cluster

It seems.

So what about Python?

So what tools are there in Python to count Grapheme clusters? It didn't seem to be included in Python's standard library, unicodedata.

answer

There seems to be a package called uniseg.

In this article, I mainly show examples in Python 3. (I won't touch on the differences in how unicode, str, and bytes are handled between Python 2 and Python 3. If you touch it, it will deviate significantly.)

Installation method

$ pip install uniseg

Example of use

>>> import uniseg.graphemecluster
>>> graphme_split = lambda w: tuple(uniseg.graphemecluster.grapheme_clusters(w))
>>>
>>> phrase = 'กินข้าวเย็น'  #It seems to be a phrase that means to eat dinner in Thai
>>> len(phrase.encode('UTF-8'))  # UTF-Bytes at 8
33
>>> len(phrase)  # Code Points
11
>>> len(graphme_split(phrase))  # Graphme clusters
8

And so on.

Other

It seems that uniseg has word and sentence-based word-separation. It seems that it can be cut with space, so it seems that it is not possible to divide the word in Japanese, which is an agglutinative language.

Recommended Posts

Count the number of Thai and Arabic characters well in Python
Divides the character string by the specified number of characters. In Ruby and Python.
Count the number of characters in the text on the clipboard on mac
[Homology] Count the number of holes in data with Python
Project Euler # 17 "Number of Characters" in Python
Count the number of characters with echo
Output the number of CPU cores in Python
Fill the string with zeros in python and count some characters from the string
plot the coordinates of the processing (python) list and specify the number of times in draw ()
How to get the number of digits in Python
Count the number of parameters in the deep learning model
How to count the number of elements in Django and output to a template
Get the size (number of elements) of UnionFind in Python
How to identify the element with the smallest number of characters in a Python list?
How to count the number of occurrences of each element in the list in Python with weight
Check the processing time and the number of calls for each process in python (cProfile)
Get the number of specific elements in a python list
Python --Find out number of groups in the regex expression
[Tips] Problems and solutions in the development of python + kivy
Maximum number of characters in Python3 shell call (per OS)
The story of Python and the story of NaN
"A book to train programming skills to fight in the world" Python code answer example --1.2 Count the number of the same characters
How to quickly count the frequency of appearance of characters from a character string in Python?
[Python] Let's reduce the number of elements in the result in set operations
Get the title and delivery date of Yahoo! News in Python
Get the number of readers of a treatise on Mendeley in Python
Check the behavior of destructor in Python
Count / verify the number of method calls.
The result of installing python in Anaconda
The basics of running NoxPlayer in Python
In search of the fastest FizzBuzz in Python
Project Euler # 1 "Multiples of 3 and 5" in Python
Graph of the history of the number of layers of deep learning and the change in accuracy
Comparing the basic grammar of Python and Go in an easy-to-understand manner
python> array> Determine the number and initialize> mylist = [idx for idx in range (10)] / mylist = [0 for idx in range (10)] >> mylist = [0] * 10
Change the saturation and brightness of color specifications like # ff000 in python 2.5
Check the in-memory bytes of a floating point number float in Python
[Python] Calculate the number of digits required when filling in 0s [Note]
Open an Excel file in Python and color the map of Japan
Get the number of articles accessed and likes with Qiita API + Python
Count the number of times two values appear in a Python 3 iterator type element at the same time
4 methods to count the number of occurrences of integers in a certain interval (including imos method) [Python implementation]
[Python] Sort the list of pathlib.Path in natural sort
Check if the characters are similar in Python
Summary of the differences between PHP and Python
Get the caller of a function in Python
Match the distribution of each group in Python
The answer of "1/2" is different between python2 and 3
View the result of geometry processing in Python
Prime number enumeration and primality test in Python
Calculate the total number of combinations with python
Specifying the range of ruby and python arrays
Divide the string into the specified number of characters
Make a copy of the list in Python
About the difference between "==" and "is" in python
Find the number of days in a month
Find the divisor of the value entered in python
Compare the speed of Python append and map
Find the solution of the nth-order equation in python
The story of reading HSPICE data in Python
[Note] About the role of underscore "_" in Python