Introduction

When I was searching for what I wanted on Amazon, I felt that the level of Japanese explanations for suspicious products had improved recently, and I was impressed that I was enthusiastic about studying even though Japanese was difficult. So I decided to make a program to weaken Japanese.

What I used

I used the following.

python 3.6.8
Mecab 0.996
mecab-ipadic-2.7.0-20070801.tar.gz: Mecab dictionary
wnjpn.db.gz: WordNet database

Problem setting

I set the following Japanese patterns that foreigners are likely to make mistakes or cannot judge subtle differences in nuances.

Wrong particles

Write the part that should be in the past form in the present form

Write a word in another homonym

1. Wrong particle

I ** write an article ** once every three months. / Articles that look interesting today are lined up with trends **. / JavaScript ** to ** I personally like Python.

Everything feels strange. Here, the conditions under which mistakes in particles and selection mistakes occur are as follows.

・ Applicable only when the part of speech of the previous word is "noun" ・ The error pattern is as follows. "" Is "<=>" is "" and "<=>" from "" to "<=>"

The code looks like this.

def mistake_ppp(ppp):
    substitute_ppp = ppp
    mistake_set = np.array([["But", "Is"],["so", "To"]["From", "Than"]])
    target_pattern = mistake_set[np.any(mistake_set == ppp, axis = 1)]
    if len(target_pattern) > 0:
        the_other = target_pattern[np.where(target_pattern != ppp)]
        substitute_ppp = the_other[0]
    return substitute_ppp

2. Write the part that should be in the past form in the present form

When I write a long sentence in English, I sometimes forget to match the tenses in the latter half, and I think it's the same feeling.

The conditions and processing details for this error are as follows.

・ When the auxiliary verb "ta" appears, the word before it is replaced with the present form.

The code (excerpt) looks like this.

    elif pos == "Auxiliary verb" and word == "Ta":
        #Replace the previous word with the present form
        word_list[word_idx - 1] =  basic_list[word_idx - 1]
        xfmd_basic = basic
        xfmd_word = ""

It feels like I've saved a list of the original words (word_list) and its canonical list (basic_list). The rest is removed by replacing the auxiliary verb "ta" with an empty string.

3. Write a word with another homonym

** Search for articles that look interesting **. / The room is ** warm ** because the stove is attached. / Answer questions to the article **.

The conditions for this error handling are as follows.

・ The conversion target is only verbs, adjectives, and nouns. ・ The verb "to" is not converted. ・ Homophones shall refer to those with the same part of speech and uninflected word as the original, and shall be in Japanese only.

In addition, the specific processing is as follows.

Use WordNet to get a list of synonyms for the target word

Select a homonym from the list of synonyms

Convert the inflected form to the same as the original word and replace it with the original word

For WordNet, I downloaded and used * Japanese Wordnet and English WordNet in an sqlite3 database * from here.

The code for 2 looks like this.

def choose_synonym(word):
    synonym = word
    pos, basic = analyze_pos(word)
    synonyms = search_synonyms(word)
    idxs = np.arange(0, len(synonyms), 1)
    np.random.shuffle(idxs)
    for idx in idxs:        
        synonym_pos, synonym_basic = analyze_pos(synonyms[idx])
        if synonym_pos == pos and synonym_basic == basic:
            synonym = synonyms[idx]
            break
    return synonym

The processing content and code of ``` search_synonyms (word)` `` that performs 1 is likely to be long if written here, so WordNet structure and synonym search ). Please refer to that for details.

In 3, since it could not be realized by the standard function of Mecab, the inflected data for each part of speech included in the dictionary data (mecab-ipadic-2.7.0-20070801.tar.gz) used in Mecab is used. I tried using it.

def transform(pos, basic, conjugate):
    target_word = None
    dict_file = None
    if pos == "verb":
        dict_file = "Verb"
    elif pos == "adjective":
        dict_file = "Adj"
    elif pos == "noun":
        dict_file = "Noun"

    with open("../dict/pos/" + dict_file + ".csv", "rb") as f:
        words = f.readlines()
    for w in words:
        winfo = w.decode('euc_jp',errors ='ignore').split(",")
        conj = winfo[9]
        basicform = winfo[10]
        if basicform == basic and conj == conjugate:
            target_word = winfo[0]
            break
    return target_word

I was a little troubled by the character code problem when reading csv. It's still developing (I didn't see it because I don't have else but I can't go there)

Summary

The processing flow is summarized below. The name of the sub-process has been shortened, but do you know the correspondence?

Try out

I applied it to the description (Amazon) of the recently purchased AirPods Pro.

\begin{align*}
&{\Large Mute noise to sound.}\\
&{\small The microphone detects noise on the outside and inside of the ear. The anti-noise function that balances the sound eliminates the noise before you listen.}\\
\\
&{\Large Ask only what you want to ask.}\\
&{\small If you want to listen to the surrounding Yoko and respond, switch to the external sound capture mode. Just press and hold the pressure sensor.}\\
\\
&{\Large Customized fit.}\\
&{\small Comfortable and wearable silicone ear tips are available in 3 sizes. Ventilation holes in the eartips even out the pressure on both sides of the earbud.}\\
\\
&{\Large Experience for the first time.}\\
&{\Small Dedicated speaker driver, high dynamic range amplifier, H1 chip are connected. It produces a high-quality sound that cannot be imagined than a compact body.}\\
\\
&{\Large setting and Siri. Everything is simple.}\\
&{\small Very easy to connect to iPhone. You can share the song with two sets of AirPods, or have Siri read the message you receive.}\\
\\
&{\Large charging is wireless. The fun is endless.}\\
&{\With the small Wireless Charging Case, you can go anywhere with your AirPods Pro. It is compatible with Qi standard chargers.}
\end{align*}

I got some suspicion. I might not have bought this.

at the end

This program itself is not useful, but I think it will be useful in future Japanese language education to remove the characteristics of weak Japanese. Foreign people studying Japanese are likely to make mistakes in learning and clustering the tendency of writing Japanese sentences, and it seems that the Japanese language education curriculum for each of them can be created.

[PYTHON] Program to weaken Japanese