[PYTHON] Kanji ranking BEST20 (UTF8 → SJIS) that appeared with garbled characters

Motivation

(Summary) Just because I wanted to know. There is no better reason.

More details

There are many types of garbled characters. Garbled characters that occur when a UTF8 file is displayed in SJIS, Garbled characters that occur when UTF8 is displayed in EUC, Garbled characters that occur when EUC is displayed in UTF8 .... You can check what each looks like on this page (https://tools.m-bsys.com/ex/html-mojibake.php).

To be honest, I've been working in UTF8 lately, so I haven't seen much garbled characters. However, at the workplace where I entered as a new graduate, SJIS was the basis for creating text files. And it happened quite often that UTF8 files were opened in SJIS and garbled characters. If the characters are garbled from UTF8 to SJIS, garbled characters such as 縺 蜈 踺 踺 蜻 渺 蜻 薙 〒 縺 縺 溘 € □ 縺 …… will occur. You can see that the same kanji appears over and over again.

There is a bias in the kanji that appear as garbled characters. I just wanted to know the kanji that often appear and their meanings ... But surprisingly, I can't find an article with that kind of content, and I can't help it, so I'll write it myself ... My ** particularly useless little curiosity ** was the driving force behind this article.

The reason why only the same kanji appears

As for the reason why only the same characters appear, I will omit it here because a wonderful article was written last year. When garbled characters from UTF-8 to SJIS, thread-biased kanji often appear

Aggregation method

Based on an appropriate long sentence, generate garbled characters in ** original UTF8 → SJIS display **, save it in a file, and count the kanji in Python.

No matter how brilliant a name sentence is, if you display it in SJIS and then save it as UTF8, you can quickly change it into a horrible appearance. I cannot help feeling the impermanence of this world.

What to choose for a long sentence, but here I will try to use what you probably know. Let's count each of Natsume Soseki's "Kokoro" in the high school textbook and Dazai Osamu's "Run, Melos!" In the junior high school textbook. The main one is "Kokoro", and the ranking of the short story "Run, Melos!" Will be introduced as a bonus.

Since the text is in Aozora Bunko, I copied and pasted it to create the data.

Natsume Soseki "Kokoro" Osamu Dazai "Run, Melos!"

The program wrote python code that wasn't particularly interesting.

import re


with open('./source.txt', encoding="utf-8") as f:
    s: str = f.read()
    
    #Count the characters that appear and use the result as a dictionary
    #At this time, characters other than Kanji are excluded. In other words, it repels "yo" and "ka".
    count_dic = {}
    for char in s:
        result = re.search('[one-鿐]', char)
        if result is None:
            #It's not a kanji so skip it
            continue
        if char in count_dic:
            count_dic[char] += 1
        else:
            count_dic[char] = 1

    #Output in ascending order
    for k, v in sorted(count_dic.items(), key=lambda x: x[1]):
        print(str(k) + ": " + str(v))

Result announcement "Kokoro" section

20th place

Appeared 1049 times $ \ Huge {Long} $ Kanji Kentei Level 1

A difficult kanji that suddenly breaks my heart came out from the top batter. Are you aware of the ranking of "Kokoro"? meet. It will be. It is a kanji called "Meguriai". It seems that Meguri-au can also be written as "Meguri-no-Uu", and Encounter can also be written as "Meguri-no-U". If you write such a character in modern times, you will probably be disliked.

19th place

Number of appearances 1112 times $ \ Huge {dai} $ Kanji Kentei 8th grade

The "dai" of "Masashi Tashiro". No stimulants.

18th place

Appeared 1190 times $ \ Huge {荳} $ Kanji Kentei Level 1

It is a character that represents the bean of a plant. It certainly looks like beans have become difficult. It seems that the princess of Emperor Keitai, Princess Sasage, came around the 6th century.

It is unknown whether it is related to this, but the bean called cowpea is an annual plant of the genus Vigna that has been eaten in Japan for a long time. Currently, azuki beans are used for most of the festive red rice, but in the past, cowpea was especially preferred. In the Edo period, azuki beans were easily torn when boiled, and were hated by samurai as "beans that are hungry lead to seppuku." Therefore, it is said that thick-skinned cowpea was used for red rice. Even now, some red bean rice fundamentalists claim that cowpea red rice is the true red rice.

...... This is a programming article. It's okay. I've written the Python code above. So it should not be erased.

17th place

Appearance number 1201 times $ \ Huge {莠} $ Kanji Kentei Level 1

Read "Hagusa" in kun'yomi. It is a weed that resembles rice but grows only leaves and does not bear fruit. For example, green foxtail, which is famous as a cat-like foxtail, corresponds to this. It looks like rice, but it doesn't bear fruit. In turn, it seems to be used as an analogy for bad things. So Yugen is a harmful and ugly word. Well, I don't use this idiom ...

16th place

Appearance 1401 times $ \ Huge {subaru} $ Kanji Kentei Level 1

It's confusing, but it's not the "excitement" of "excitement." "Subaru" is read as "Subaru". It is a star. Sei Shonagon in the Heian period praised Subaru, saying, "The stars are Subaru. Hikoboshi. Yufutsu. Yobahiboshi, a little bit." Subaru in the wind, the galaxy in the sand, where everyone went, without being seen off ...

15th place

Appeared 1493 times $ \ Huge {峨} $ Kanji Kentei Level 1

If you chose Japanese history in the liberal arts, you should have seen the word Emperor Gosaga. Even if not, you may see this kanji in your personal name. The word 峨 represents a high mountain and a rugged state.

14th place

Appeared 1512 times $ \ Huge {翫} $ Kanji Kentei Level 1

It can be read as "playing", "playing", and "playing". Shikanjima is a kimono pattern that was popular in the Edo period. It is a pattern that combines the shape of four vertical stripes and a metal ring, and it was semantically correct to write "four stripes", but the third generation of Kabuki actors. It seems that this kanji was named after the haiku name "Shikan" by Nakamura Utaemon.

13th place

Appeared 1553 times $ \ Huge {medicine} $ Kanji Kentei 8th grade

Is there a doctor among our customers? ??

12th place

Appeared 1555 times $ \ Huge {above} $ Kanji Kentei 10th grade

Is there a company that accepts receipts and is okay with your superiors?

11th place

Number of appearances 1625 times $ \ Huge {suke} $ Kanji Kentei Level 1

Although it often appears by personal name, it is surprisingly treated as Kanji Kentei Level 1. "Suke" that appears in the reading problem of Kanji Kentei Level 1 is read as "Hohitsu". Condolence is to advise what should or should be done as an act of the emperor.

10th place

Appeared 1794 times $ \ Huge {Saka} $ Kanji Kentei Level 2

There are two types of Saka, "Saka" and "Saka". There seem to be various theories as to why, but according to "Setsuyo Ochihoshu" published in 1808, the slope is divided into "soil" and "anti" when it is decomposed, and it can be read as "returning to the soil". It is said that there were people who hated using "Saka" if it was not good. If you write Osaka from Saya, it's Akan. I don't know.

9th place

Number of appearances 2215 times $ \ Huge {吶} $ Kanji Kentei Level 1

It is a kanji that sometimes appears in novels. I write that it is 吶. He started talking to me. Talking in a humorous way means talking in a muffled manner. The undecorated and reticent thing is called "I". There is a word-biased "訥" and a mouth-biased "吶", but they seem to have the same meaning.

8th place

Number of appearances 2482 times $ \ Huge {Nagi} $ Kanji Kentei Level 1

It is treated as Kanji Kentei Level 1, but no explanation is necessary. The three sacred treasures "Yata no Kagami", "Kusanagi no Tsuruga" (also known as Kusanagi no Tsurugi), and "Yasakani no Magatama" are compulsory education for otaku.

7th place

Appeared 3147 times $ \ Huge {after} $ Kanji Kentei 9th grade

From here onward, the number of appearances will jump up.

6th place

Number of appearances 4078 times $ \ Huge {溘} $ Kanji Kentei Level 1

The word "溘" means instantly. "Dying as a ghost" is, roughly speaking, "sudden death !!!".

5th place

Appeared 4718 times $ \ Huge {encouragement} $ Kanji Kentei 3rd grade

It's encouraging to like Qiita. Please press.

4th place

Number of appearances 5831 times $ \ Huge {nine} $ Kanji Kentei 10th grade

4th place though it is nine. From now on, I'm a thread-biased musketeer who is really often seen with garbled characters.

3rd place

Appeared 6656 times $ \ Huge {縲} $ Kanji Kentei Level 1

Rasengan! !! !! It is not. It is not a bug bias but a thread bias. "縲" is a fairly niche kanji that represents a rope that binds sinners. It seems that it is mainly used as a set with "Setsu", as it says, "Rather than being humiliated by Rui Setsu, I will not commit suicide right now." The 絏 seems to mean to squeeze.

Before the Showa era, there were no metal handcuffs, so the sinner was tied up with a rope. In the Edo period, hojojutsu (hojojutsu / torinawajutsu) was widely used as part of the hojojutsu (martial arts that capture the enemy without killing them with bare hands). There are "Haya-nawa" that quickly restrains the enemies that have been seized, "Hon-nawa" that is used formally and ceremonially, and "Torture-nawa" that is used to torture by bondage. It seems that different tying methods were prepared. Hojojutsu is a well-established martial art, and there were 150 schools in the Edo period ... It's amazing.

2nd place

Number of appearances 12928 times $ \ Huge {sale} $ Kanji Kentei Level 1

The number of appearances was greatly separated from the 3rd place, and nearly doubled. This hateful kanji that you would have seen over and over again due to garbled characters even if you were not an engineer ...

There is a word "ungen". To put it simply, it is an old gradation. Introduced from western China, it was used for Buddhist paintings in the Nara and Heian periods, temple decoration, and dyeing and weaving. And there is a color term called "ungensai-shiki". You can see it by looking at the concrete picture. It seems that it will appear in color tests, so maybe some web designers may know it. Reference: What is the meaning of the color? There is a treasure of Shosoin called Urushikinpakue no Ban, which has an easy-to-understand color scheme. (Source: Imperial Household Agency website http://shosoin.kunaicho.go.jp/ja-JP/Treasure?id=0000014245) If you look closely, you can see that the colors of the same type overlap in layers instead of blurring and overlapping the colors. This is the color of the product. As a familiar example, the Vue logo may also be said to be colorful.

The first place

Appearance number 60693 times $ \ Huge {縺} $ Kanji Kentei Level 1

With an overwhelming number of appearances of 60,000 times, he earned an unrivaled score. The king of UTF8 → SJIS garbled world is the strongest kanji character "縺" of the thread biased three musketeers!

I often hear the kanji for "entanglement", such as "entanglement of lust". Tangle is entanglement. It is said to "untwist the thread". Confusingly, there are two words, "fraying" and "tangling". As the kanji for "unravel" and "entangle" point to, The unraveling is unraveling and the entanglement is entwined, so the meaning is exactly the opposite. Let's watch out.

Tangled tongue means that you can't say anything even if your tongue is tangled. To be specific, withdrawal engineers like me often get entangled when talking to people who meet for the first time. Tangled hair (tangled hair) is tangled hair. It's messy hair. Specifically, it is the hairstyle of a general engineer. The kanji suitable for an engineer has won the first place! !! !! (Noisy)

Result announcement "Run, Melos!" Category

If you end up with only "heart", the above result will only work for "heart", right? ?? ?? Isn't the result completely different with other sentences? ?? ?? It's easy to think that, so just in case, I tried to rank other sentences as well. The rankings have changed a little, but the results are generally similar with some exceptions. Since Melos is a short story, the amount of text is small.

41st place: 20 times (not ranked) 32nd place: 24 times (not ranked) 22nd place 莠: 48 times (not ranked) ――――――――――――――― 20th place: 54 times new! (Adventure Tan) 19th place: 56 times 18th place Reconnaissance: 57 times new! (Recon reconnaissance) 17th place: 59 times 16th place: 63 times 15th place: 76 times 14th place: 80 times 13th place: 92 times 12th place top: 98 times 11th place Osaka: 98 times 10th place Doctor: 111 times 9th place Nagi: 117 times 8th place Encouragement: 149 times ** 7th place: 156 times new! (For those who are ungen) ** 6th place: 222 times After 5th place: 224 times 4th place 9: 290 times 3rd place: 753 times 2nd place: 933 times 1st place: 2944 times

The point to note is that ** "", which is the partner of 2nd place "", is ranked in 7th place **. Actually, "Kokoro" had an incomplete combustion result of 78th place, which appeared 172 times in "Kokoro", but in Meros, it showed its ability without regret. "Sho" also has an expression of joy in the leap of the partner.

Actually, the character "" is also pasted in another article "[If you garble from UTF-8 to SJIS, the kanji of thread bias often appears](https://qiita.com/kaityo256/items/ It is also introduced in "878cbe35d4c8444b045a)", but when the characters of "Dachijitsuzutedetodonaninune nohabapahibipifubupuhebepehobopomamemiyayuyoyorarirurero wawa ヱ onvuka" are garbled come out. So, if you garble "Meros", the character "" will appear twice like "", but the percentage of katakana in "Kokoro" is compared to the current sentence. Since there are quite a few, I cannot deny the feeling that I was forced to fight a little disadvantageous for the players. If the thread-biased musketeer were to become the thread-biased four heavenly king, his weakest position would definitely be the "" player.

in conclusion

This is a programming article ... It's a pain ...

It would be great if you could tell that the somewhat scary kanji that appears as a result of the garbled characters was surprisingly interesting when I looked it up. The mysterious kanji that I hate is also a living (was) kanji with a background of literature and history, so please don't bully me too much.

Recommended Posts

Kanji ranking BEST20 (UTF8 → SJIS) that appeared with garbled characters