[PYTHON] Don't use too strong words ... It looks weak ~ Let's check if AI can be paraphrased "gentlely" ~

I wrote the title as AI, but please guess. Please…

I think that an engineer is some kind of nerd, but do nerds have the experience of being drawn by talking to ordinary people in technical terms? I have.

So, this time, my expression is Is it a "transmitted" expression? What is a more understandable expression? Let's judge that with the help of machine learning. It is an attempt.

TL;DR

We have proposed and implemented a system that combines semantic equivalence analysis by the BERT + CNN head and word difficulty judgment by the textbook corpus to judge whether a sentence is paraphrased in a way that is easy and meaningful.

github https://github.com/MosasoM/Aizen I've put a source that I can easily try, so I think it would be interesting if you could try it.

Standard text Paraphrase text score                    Implication probability             difficulty      
Please summarize the agenda in paper form for the next meeting Summarize the agenda in a resume for the next meeting 0.08641 0.744 9
Please summarize the agenda in paper form for the next meeting Put together a piece of paper to discuss for the next meeting inf 0.994 0

In this case, "Please put together a piece of paper to discuss for the next meeting" is a more appropriate and gentle paraphrase.

There is one thing I want to do as an opportunity to review my communication, which tends to use difficult words, and what does this word mean again? I think it is important to think that it is also important.

There are many places that are loose, but I think it would be great if we could use this as a starting point to make better products.

Introduction

You don't have to read here.

I think everyone is using more or less technical terms in their work. To be sure, jargon is so much compressed in meaning that it is very useful for those who share that knowledge.

But I think it's a well-known fact that jargon can be very unpleasant for someone who doesn't know it. However, I am obsessed with the magical power of the amount of information that technical terms have, and without recognizing the degree of understanding of the other party (it is unrealistic to proceed while checking all the knowledge of the other party in the first place), I just use the technical terms a lot. I think that there are many people who have had the experience of failing in communication.

This kind of phenomenon is more common among people who are called "excellent" with a large number of standard vocabularies, but it is a loss for society as a whole that an excellent person loses his reputation due to such communication failure. I think it can be. There is one need for an easy paraphrase of the text here. Of course, there are many other situations where easy paraphrasing of sentences is required, such as when explaining to children or when teaching language beginners. In addition, trying to paraphrase sentences gently is considered to have merits such as being able to deepen thinking by reconfirming the definition of something that is not only for the recipient of the conversation but also for the speaker. ..

Based on the above background, this article summarizes what we did with the aim of determining whether a sentence is a gentle paraphrase of the standard sentence and building a feedback system.

↑ I don't understand difficult words! In other words easily !!!

Therefore, we will create a system that determines whether it is a simple paraphrase and gives feedback.

Since it is a prototype, there are many gabber points. Gabber points as far as you are aware are put in each chapter by folding.

Method outline

Since it is necessary to be "gentle" and "paraphrase", we will create a "gentle" judgment and a "paraphrase" judgment machine. For "gentleness", we used the difficulty level of the words that are simply used, and for the judgment of "paraphrasing", we used the implication judgment task by BERT + CNN.

model_sumary_aizen.png

Gabber Point 1 It would be good if not only the judgment of words but also the judgment of responsive expressions could be added. There is also a gabber for the implication judgment task, which will be described later.

Method 1: Judgment of kindness

Morphological analysis by JUMAN ++, and independent words (nouns, verbs, adjectives, adverbs) as a result of morphological analysis ), The difficulty level is determined by the first grade in Textbook Corpus Vocabulary Table. (Thanks to Kurohashi / Kawahara Laboratory and National Institute for Japanese Language) (I borrowed a vocabulary table because I thought it was for educational and research purposes, but if there is a problem, I will use another one.)

This time, independent words that are not on the textbook corpus are not counted. Because I couldn't find a sa verb like ~, and I rarely use words at a level that doesn't fit in the textbook corpus in a simple paraphrase, so I emphasized the ease of implementation.

If JUMAN ++ is currently installed on a Mac

brew install jumanpp

It is possible with. This time I will use it from the python binding, so I will also install KNP, but since it will be entered with pip,

!pip install pyknp

It's okay. I wget each of them with old information, but now I can do it. Next, make a difficulty judgment dictionary. That said, it's not that difficult, just a little processing of the vocabulary table and having it as a dictionary type.


vocab = pd.read_csv("./textbook.txt",encoding="utf8",sep="\t")
vocab = vocab[["Lexeme","Lexeme読み","First grade"]]
grade_dif_dic = {"small_Before":0,"small_rear":1,"During ~":2,"High":3}
vocab_diff_dic = {}
for line in vocab.values:
    if line[0] in vocab_diff_dic and grade_dif_dic[line[2]] < vocab_diff_dic[line[0]]:
        vocab_diff_dic[line[0]] = grade_dif_dic[line[2]]
    else:
        vocab_diff_dic[line[0]] = grade_dif_dic[line[2]]

I just made a dictionary like this. Pandas map Please forgive me for making it smarter. I don't think there are so many, but if the headword is worn, I try to interpret it in a better way (the one with the earliest appearance is adopted).

Morphological analysis is almost as in the example

Reference

Of these, only for those whose part of speech is an independent word, the difficulty level is obtained from the dictionary and the sum is taken. In particular,

test1 = "This is a trial text and is different from the text used in this experiment."
result = jumanpp.analysis(test1)
d1 = 0
d1_cnt = 0
for mrph in result.mrph_list(): #Access each morpheme
    if mrph.hinsi in indep:
        if mrph.genkei in vocab_diff_dic:
            d1 += vocab_diff_dic[mrph.genkei]
            d1_cnt += 1
            print(mrph.genkei,vocab_diff_dic[mrph.genkei])
print("Total difficulty:{}Number of independent words:{}Average difficulty:{}".format(d1,d1_cnt,d1/d1_cnt))
Total difficulty:4 Number of independent words:6 Average difficulty:0.6666666666666666

It's like this.

Method 2: Analysis of semantic equivalence

This is the point that includes the most gabber, so please be prepared.

2-1: Data preparation

This time we will use BERT effectively as a preprocessing of the data, so this chapter will only describe where the data was borrowed. Since we will determine if the meanings are equal this time, we will use the data set for determining implications as a similar task. In addition, I used the following data because I want you to judge the easy paraphrase.

Nagaoka University of Technology Natural Language Laboratory Easy Japanese Corpus NTCIR-10 RITE2 [Kyoto University Kurobashi / Kawahara Laboratory Textual Entailment Evaluation Data](http://nlp.ist.i.kyoto-u.ac.jp/index.php?Textual%20Entailment%20%E8%A9%95%E4 % BE% A1% E3% 83% 87% E3% 83% BC% E3% 82% BF)

Of these, the easy Japanese corpus will be closed at the end of March 2020 due to the closure of the laboratory, so it seems better to download it earlier.

Thank you to the author of each dataset.

If the data is created up to this point, the easy Japanese dataset has only implication data, and since the number of data is large, it becomes unbalanced data with a large amount of implication data, which is troublesome.

Therefore, at the end, two different sentences were randomly selected from the above data group and combined to inflate the non-implicit data to make equilibrium data.

Click here for the code to create the above data
def data_load():
    train_x = []
    train_y = []
    
    true_count = 0
    
    entail_kyoto = ET.parse("/content/drive/My Drive/Aizen/Datas/entail_evaluation_set_label_conv.xml") #Kyoto University Corpus
    entail_kyoto_root = entail_kyoto.getroot()
    for child in entail_kyoto_root:
        temp = []
        if child.attrib["label"] == "Y":
            train_y.append(1)
            true_count += 1
        else:
            train_y.append(0)
        for gchild in child:
            temp.append(gchild.text)
        train_x.append(temp)
    RITE_names = ["dev_bc","dev_mc","testlabel_bc","testlabel_mc"]
    pos = ["F","B","Y"]
    neg = ["C","I","N"]
    for name in RITE_names: #Load RITE. Read each test and divide later
        rite_file_path = "/content/drive/My Drive/Aizen/Datas/RITE2_JA_bc-mc-unittest_forOpenAccess/RITE2_JA_{}/RITE2_JA_{}.xml".format(name,name)
        rite_tree = ET.parse(rite_file_path)
        root = rite_tree.getroot()
        for child in root:
            temp = []
            if child.attrib["label"] in pos:
                train_y.append(1)
                true_count += 1
            else:
                train_y.append(0)
            for gchild in child:
                temp.append(gchild.text)
            train_x.append(temp)
    easy_jp = pd.read_csv("/content/drive/My Drive/Aizen/Datas/T15-2020.1.7.csv").values #Easy Japanese reading.
    for line in easy_jp:
        if line[1] != line[2]: #Only the one that changed in other words
            train_y.append(1)
            train_x.append([line[1],line[2]])
            true_count += 1
    
	#From here, fuse two random sentences until the correct and incorrect answers are the same, and inflate the incorrect answers.
    all_num = len(train_x)
    
    for i in range(2*true_count-all_num):
        left_raw = np.random.randint(all_num)
        left_col = np.random.randint(2)
        right_raw = np.random.randint(all_num)
        while left_raw == right_raw:
            right_raw = np.random.randint(all_num)
        right_col = np.random.randint(2)
        train_x.append([train_x[left_raw][left_col],train_x[right_raw][right_col]])
        train_y.append(0)
    
    
    
    return train_x,train_y

Gabber Point 2
Since the implication data set basically has only A → B or A = B, it may have been better for this purpose to prepare A Therefore, there is a good possibility that the result obtained by this model has poor generalization performance as a general implication determination task. However, this time it is enough to judge whether the easy paraphrase is implied, so this time it is okay because it contains enough easy paraphrase data for this purpose.

2-2

As a model, I use BERT, which is a hot natural language processing model recently (or rather a while ago?). I didn't really understand what BERT was, so I referred to the following site. https://udemy.benesse.co.jp/ai/bert.html https://ainow.ai/2019/05/21/167211/ http://deeplearning.hatenablog.com/entry/transformer

The model used this time is the BERT + CNN model. The BERT part is quite difficult to train on my own, so when I was wondering what to do, I found the following very grateful ones. https://yoheikikuta.github.io/bert-japanese/ God. Thank you very much.

By the way, since this is a Sentence piece, I have to create learning data ...? https://qiita.com/hideki/items/1ec1c21c33326ad5615f

Eh, is it too convenient? Thank you very much. As a result, the accumulation of great ancestors has made it possible to instantly acquire vectors with the monster model BERT.

So, I will make the head part. As a poor student this time, I couldn't prepare enough computing resources to fine-tune each BERT, so I tune only the head part with Google Colablatory. (Thanks to google)

The vector obtained by BERT is word by word. Enter vectors for two documents and solve the problem as a two-class classification of the same or different meanings.

The point that I devised a little here is that the input vector was created by inserting the vectors of two documents alternately. This is because I thought that it would be better to have a vector with position information that puts words that are close to each other closer to each other than to arrange them in parallel.

BERT_con.png

The above vectors are classified by CNN. The configuration of CNN is like this with keras.

def make_network():
    x_in = Input((256,768,1))
    x = Conv2D(64,(3,3),padding="valid")(x_in)
    x = BatchNormalization()(x)
    x = Activation('relu')(x)
    x = Conv2D(64,(3,3),padding="valid")(x)
    x = BatchNormalization()(x)
    x = Activation('relu')(x)
    x = MaxPooling2D(pool_size=(3,3),strides=2,padding="same")(x)

    x = Conv2D(64,(2,2),padding="valid")(x)
    x = BatchNormalization()(x)
    x = Activation('relu')(x)
    x = Conv2D(128,(2,2),padding="valid")(x)
    x = BatchNormalization()(x)
    x = Activation('relu')(x)
    x = MaxPooling2D(pool_size=(3,3),strides=2,padding="same")(x)

    x = GlobalAveragePooling2D()(x)

    x = Dense(2)(x)
    x = Activation("softmax")(x)
    model = Model(x_in,x)
    return model

Well, it's just CNN with no particular changes.

Gabber Point 3 Considering only the horizontal direction, I feel that using a filter such as 1 * 5 rather than a 3 * 3 filter can see relatively far and reduce the number of parameters.
The generator that throws the above data into the model looks like this.
def data_gen(data_X,data_y,vectorian,batchsize=32):
    d_cnt = 0
    x = []
    y = []
    while True:
        for i in range(len(data_X)):
            t_l = str(data_X[i][0])
            t_r = str(data_X[i][1])

            t_l = vectorian.fit(t_l).vectors
            t_r = vectorian.fit(t_r).vectors


            itp = range(t_l.shape[1])
            in_x = np.insert(t_r,itp,t_l[:,itp],axis=1)
            
            x.append(in_x)
            
            if data_y[i] == 1:
                y.append([0,1])
            else:
                y.append([1,0])
            
            d_cnt += 1
            
            if d_cnt == batchsize:
                inputs = np.array(x).astype(np.float32)
                inputs = inputs.reshape(batchsize,256,768,1)
                targets = np.array(y)
                x = []
                y = []
                d_cnt = 0
                yield inputs,targets

Well, it's not interesting. So I started learning

It took 1 epoch an hour, so learning at Google Colablatory was a pain. The accuracy of 3 to 5 epoch is like this (I erased the output of 1 to 2 epoch due to the GPU limitation on the way and the new runtime.)

Epoch 1/1
1955/1955 [==============================] - 3846s 2s/step - loss: 0.1723 - categorical_accuracy: 0.9396 - val_loss: 0.1632 - val_categorical_accuracy: 0.9398
Epoch 1/1
1955/1955 [==============================] - 3852s 2s/step - loss: 0.1619 - categorical_accuracy: 0.9436 - val_loss: 0.1477 - val_categorical_accuracy: 0.9487
Epoch 1/1
1955/1955 [==============================] - 3835s 2s/step - loss: 0.1532 - categorical_accuracy: 0.9466 - val_loss: 0.1462 - val_categorical_accuracy: 0.9482

It seems that the loss will still drop, but it is difficult to stick it in front of the PC, so please forgive me ... By the way, this time it is forced, but since I made equilibrium data, I will not calculate F1 ~~ (Troublesome) ~~

If it is equilibrium data, I think that accuracy will not give such a strange result.

output

You can use the above model and dictionary to make a sentence paraphrase judgment and a difficulty score. Prepare an output function as appropriate.

def texts_to_inputs(text1,text2):
    t_l = text1
    t_r = text2

    t_l = vectorian.fit(t_l).vectors
    t_r = vectorian.fit(t_r).vectors

    itp = range(t_l.shape[1])
    in_x = np.insert(t_r,itp,t_l[:,itp],axis=1).reshape(1,256,768,1)
    return in_x

def easy_trans_scores(test1,test2,vocab_diff_dic):
    test_x = texts_to_inputs(test1,test2)
    p = model.predict(test_x)

    result = jumanpp.analysis(test1)
    d1 = 0
    d1_cnt = 0
    for mrph in result.mrph_list(): #Access each morpheme
        if mrph.hinsi in indep:
            if mrph.genkei in vocab_diff_dic:
                d1 += vocab_diff_dic[mrph.genkei]
                d1_cnt += 1
    print("Implication probability:{:.3g}Total difficulty:{}Number of independent words:{}Average difficulty:{}".format(p[0][1],d1,d1_cnt,d1/d1_cnt))
    print("total score:{:.5g}".format(p[0][1]/d1))

For now, let's use implication / (total difficulty) as the score. The reason for the total rather than the average difficulty is that I thought it was better to evaluate short and concise sentences well. Well, there is a bug that the value always becomes inf when the total difficulty level is 0 and there is no implication relationship ... This time it's okay ...

result

Standard text Paraphrase text score                  Implication probability            difficulty               
This is a trial text and is different from the text used in this experiment. This is a trial sentence 0.88207 0.882 1
This is a trial text and is different from the text used in this experiment. This text is prepared for the test only and is different from the one actually used. 0.20793 0.832 4
This is a trial text and is different from the text used in this experiment. This is a test 0.53867 0.539 1
Please summarize the agenda in paper form for the next meeting Summarize the agenda in a resume for the next meeting 0.086415 0.744 9
Please summarize the agenda in paper form for the next meeting Put together a piece of paper to discuss for the next meeting inf 0.994 0
Complete hypnosis that controls the five senses and spiritual pressure perception of the other person who has seen the moment of liberation even once, and can mislead the subject. If you see the moment of the solution even once, you will surely misunderstand the object. 0.063651 0.287 5
Complete hypnosis that controls the five senses and spiritual pressure perception of the other person who has seen the moment of liberation even once, and can mislead the subject. You can make the subject feel wrong by manipulating the person who saw the moment of the solution even once. 0.11004 0.654 7

The paraphrases that are judged to have the highest score are in bold. Isn't it a reasonably convincing result?

By the way, Mr. Aizen ... Please give me some kinder words ...

Result 2 (Addition) In other words, good battle and power of Kim Roger

Using this system, I paraphrased my friend and played a good king deciding match (1vs1 game, but if I win, I will be the king) Does it look like this when made into a game? It's like that.

Click here for the theme.

** The man who gained wealth, fame, power, and all of the world'Gold Roger, the Pirate King'The words he gave at his death drove people to the sea. ** **

This is the mouth of the OP of a certain national anime that everyone knows.

First move (I) 1st turn

** Money, position, power The man who put everything in this world, Gold Roger, the pirate king The words he said before he died made people want to go to the sea **

Score: Implication probability: 0.696 Total difficulty: 8 ** total score: 0.087022 **

Well, I think it's a good first move. Since the implication probability is a little low, I feel that I attacked too much to lower the difficulty level.

Gote (opposite) 1st turn

** "The man who got all the money, popularity, power, and the world,'Pirate King Gold Roger' The word he said just before his death sent people to the sea" **

Score: Implication probability: 0.974 Total difficulty: 14 ** total score: 0.069588 **

It's talked about with implication probabilities, but I'm still better at the total score. If you don't attack by paraphrasing words, it will be like this. It is surprisingly strategic.

First move (I) 2nd turn

** Money, popularity, power, the man who got everything in the world,'Pirate King Gold Roger' The words he said before he died made people want to go to the sea **

Score: Implication probability: 0.917 Total difficulty: 9 ** total score: 0.10194 **

Since the other party had a higher implication probability, we were able to greatly improve the implication probability and total score by incorporating it as a model. This is ** won ** (flag)

Gote (opposite) 2nd turn (winning sentence)

** A man who got all of money, popularity, power, and the world'Pirate King Kim Roger'The word he said when he died sent everyone to the sea **

Score: Implication probability: 0.952 Total difficulty: 9 ** total score: 0.10581 **

e···?

lost. Although it is a close margin, losing is losing. Probably the winner was ** Kim Roger **. This greatly reduces the difficulty level while maintaining the implication probability. (When I saw the result, I was laughing.)

Well, it is also an interesting part of this system that you can have a fun paraphrasing good king deciding match like this. ~~ Because it's sloppy here and there ~~ How do you interpret this system? You can compete for points that are competitive, including meta-competition, and points that are within the bounds of common sense, and each has its own fun.

Summary and future prospects

Although there were some kettles and looseness, I was able to somehow judge whether or not a simple paraphrase was made by combining the implication judgment and the difficulty judgment.

** This is pretty interesting to do by myself ** (It is also interesting to set the question, "Is it paraphrased well?" Or isn't such a sentence difficult?) So, I want to make this into a game or a service. (I want to implement online ranking, etc.) (I can't do it now because I don't have the knowledge to build a service)

As I mentioned at the beginning, if you enjoy this as a game, you may be able to enjoy it, and at the same time, you may be able to create a data set of easy paraphrases made by people, and by using that data you can make more interesting considerations. I'm thinking about it.

This model etc. (for those who want to try it)

https://github.com/MosasoM/Aizen

It happened in. I wrote it in the repository,

  • JUMAN++
  • text-vectorian
  • BERT Japanese learned model (https://yoheikikuta.github.io/bert-japanese/)

If you have a look at Aizen.ipynb, you can (probably) try it quite easily. If you have any interesting results or if you like this scoring system, I would appreciate it if you could comment.

Acknowledgments

This article is based on the borrowing of BERT trained models and various data sets. As mentioned in the text, thank you to all the people who maintained these.

Recommended Posts

Don't use too strong words ... It looks weak ~ Let's check if AI can be paraphrased "gentlely" ~
Quantify "Don't use too strong words" (Sentiment analysis starting with BLEACH)
Check if mod_wsgi can be built