[PYTHON] Topic extraction of Japanese text 2 Practical edition

Aidemy 2020/10/30


Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This time, it will be the second post of topic extraction of Japanese text. Nice to meet you.

What to learn this time ・ Implementation of answer sentence selection system

Answer sentence selection system

-__Answer sentence selection system __ is a system __ that is given multiple answer sentence candidates for a question sentence and automatically selects the correct answer from among them. -For the dataset, use Textbook Question Answering, and use the training data as "train.json" and the evaluation data as "val.json".

Data preprocessing

-When using deep learning in natural language processing, data cannot be handled as it is. As we learned in "Natural Language Processing", sentences must first be divided into words "separate writing". This time, we will also assign an ID to each word . -In addition, since matrix calculation cannot be performed if the length of the input sentence is different, it is also necessary to " unify the length of the input sentence". This is also called Padding. Details will be described later, but 0 is added to short sentences and long sentences are deleted.

Normalization / word-separation

-As data preprocessing, normalization and word-separation are performed first. Since the data is in English this time, normalization and word-separation will be done in English. -English normalization uses the method of __unifying to case. This time, __ "unify all lowercase letters" __ processing. You can make it lowercase by using the __ "lower ()" __ method for English text. ・ Use a tool called nltk for English word-separation. If you pass a normalized sentence to __ "word_tokenize ()" __ of nltk, __ will return a list divided by word __.

-Code (results are ['earth','science','is','the','study','of']) スクリーンショット 2020-10-25 21.42.52.png

Word ID

-Since the word itself is not treated as input to the neural network, it is necessary to give __ID __. -If IDs are given to all words, the number of data will be too large, so only those with a certain frequency or more are converted to __ID __.

-The actual code is as follows. See below for a detailed explanation. スクリーンショット 2020-10-25 21.59.33.png

・ About the above code __ "def preprocess (s)" __ is a function __ that performs __normalization and word-separation in the previous section. Below that, __preprocess () __ is used to normalize and separate the ['question'] question text and ['answer Choices'] answer text of the train data, and the result (list) is an empty list. Store in "sentences". For each word (w) in the "separated list (s)" in these sentences, a dictionary with the frequency obtained by "vocab.get ()" using __'w'as a key. Store in "vocab" __. In addition, an empty dictionary "word2id" is prepared, and for each key (w) and value (v) of vocal, the same key (w) does not yet exist in __word2id, and the value (v), that is, the frequency is 2 or more. Gives an ID with "len (word2id)" __. Incidentally, all value if the frequency is deemed '' 1 has preset so that 0. The "target" part at the bottom actually normalizes and divides'question', and gets the id in "word2id" for each word.

Padding -As data preprocessing, padding__ that unifies the length of the __ statement is performed at the end. Specifically, for short sentences, add 0, which is a dummy ID, at the end as much as necessary __, and for long sentences, delete as many words as necessary from the end of the sentence __. It is done in. -To execute padding, use keras __ "pad_sequences (argument)" __. Pass data as the first argument. The arguments below that are as follows. ・ Maxlen: Maximum length -Dtype: Data type ("np.int32" in this case) -Padding: Specify'pre'or'post'. Specifies whether padding is performed from the front or the back of the sentence. -Truncation: Same as padding. This deletes the word. -Value: Specify which value to padding. (0 this time)


Model building

Overall picture

-Use __ "Attention-based QA-LSTM" __ for the model. The procedure is as follows. (1) Implement __BiLSTM for each of Question and Answer __. (2) From Question, Attention to Answer and acquire Answer information considering Question. (3) From the Question and Answer after Attention, __ calculate the mean (mean_pooling) of the hidden state vectors at each time __. ④ Output __ by combining the two vectors of ③.

・ Figure![Screenshot 2020-10-30 13.08.44.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/17586dd7-fb8e-eccd- 11f6-490c8177080c.png)

(1) Implement BiLSTM for Question and Answer

-BiLSTM is a "bidirectional recursive neural network" __ that inputs values from both directions, as seen in Chapter 1. Implementation is done with __ "Bidirectional (argument)" __. -Set the input layer of Question as "input1", and set Embedding as BiLSTM with __Bidirectional () __. Similarly, for Answer, implement BiLSTM with the input layer as "input2".

・ Code![Screenshot 2020-10-25 23.38.26.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/f8543aa5-8428-9c29- 7e4b-d07a70fcd345.png)

② Attention from Question to Answer

-Using Attention, let the machine "determine whether Answer is valid as an answer to Question", that is, by calculating the characteristics of Answer by considering the hidden state vector of Question at a certain time __Question You can get Answer information considering the information of __. -Calculate the matrix product with "__dot () __" for "h1" and "h2" that dropped out BiLSTM "bilstm1" and "bilstm2" of two sentences, apply __Softmax function __ to it, and then __ It can be created by calculating the matrix product of this and h1 __, and connecting this and h2 with "__concatenate () __" to form a Dense layer.

・ Code![Screenshot 2020-10-30 13.09.17.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/228e3f25-c90b-1321- 4be7-744ecf59b6c9.png)

Calculate the mean (mean_pooling) of the hidden state vectors at each time

-From the Question and Answer after Attention, __ Calculate the average of the hidden state vectors at each time __. The average at this time is called __ "mean_pooling" . -Mean_pooling is executed by __ "AveragePooling1D (argument) (x)" __ of keras. For each argument - Pool_size : Specify the length of the data x to be passed - Strides : Specify an integer or None - Padding __: Specify'value'or'same'

-In the code (described later), first use AveragePooling1D for "h1 (Question output)" and "h (Answer output)" created in the previous section. In order to combine in ④, these are Reshape.

Output by combining the two vectors of ③

-Finally, combine "mean_pooled_1" and "mean_pooled_2" created in ③ with __concatenate () __. -In this code, it is necessary to pass __ "sub" and "mult" to concatenate (), so create them respectively. After doing this Reshape and creating an output layer as output, create a model with __Model () __. The input layer passes "input1" and "input2" created in ①.

・ Code![Screenshot 2020-10-30 13.11.43.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/f2bdd73d-46b6-53e5- 6c0b-38e4a160b157.png)

Model learning


-Since the model has been created, the next step is __ to train this model __. The model training itself can be "__model.fit () __", but before that, it is necessary to create __ training data and correct label __. ・ For __training data, list and pass questions and answers . As a procedure, first create empty lists "questions" and "answers", store 'question' of train data in the former, and store the value part of 'answerChoices' in the latter. And pass it to model. -For the correct answer label, put " the same choice as the answer [1,0], the others as [0,1] __" in the empty list "outputs", and put this in the empty list "outputs" __np.array () __ After converting to NumPy format with, pass it to model. You can check whether __ is the same option as the answer, and see if the key (number) of __'answerChoices' matches'correctAnswer'.

・ Code![Screenshot 2020-10-30 13.13.28.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/719c1b8b-3d23-d87b- 10fb-50e2c52c6fa0.png)


-Finally, __ test the accuracy of the model using the evaluation data __. This time, it is ___2 value classification __, so __ "Accuracy", "Precision", and "Recall" __ are calculated as accuracy indicators. ・ (Review) Regarding the actual classification of predictions, there are four types: __ true positive, false positive, false negative, and true negative __, and "how much the prediction was correct in the whole" is _correct rate _, "The ratio of what was predicted to be positive that was actually positive" is the __match rate __, and "the ratio of what was predicted to be correct and the prediction was also positive" is __recall rate __. ・ In calculating these indicators this time, we must first check the number of __true positives __. In order to find out, there must be a "prediction" and a "correct answer" for the classification, so get this. The "prediction" can be obtained with __ "model.predict ()" __, and the "correct answer" can be obtained as it is from the outputs created in the previous section. Both of these are stored as [1,0] if the answer is correct (correct) and [0,1] if the answer is incorrect (negative), so if you extract only the second column with __axis = -1, you can see the positive or negative. _. After that, based on this positive and negative, we classify four such as true positive and calculate the correct answer rate.

・ Code![Screenshot 2020-10-30 13.15.06.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/7c682822-8e73-d2d2- 5dbd-e29e3ff4a859.png)

・ Result![Screenshot 2020-10-30 13.15.24.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/688beca2-21d0-be64- 8a39-13419bb1be8c.png)

Visualization of Attention

-Attention from sentence s to t, __ $ a_ {ij} $ __ indicates how much the j-th word of __s pays attention to the i-th word of t _. The matrix A that has this $ a {ij} $ as the (i, j) component is called Attention Matrix. Looking at this, you can visualize the relationship between the words __s and t __. -The vertical axis is answer word, and the horizontal axis is question word. The white part __ is more closely related.

・ Figure (code below) スクリーンショット 2020-10-30 13.24.25.png

・ Code![Screenshot 2020-10-30 13.24.58.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/13506aa3-2328-0435- c88b-bccd51d305b8.png)


-When passing data to a deep learning model, it is necessary to preprocess the data. There are four types of pre-processing: word-separation, normalization, ID conversion, and padding. -The model "Attention-based QA-LSTM" used this time is constructed by implementing BiLSTM, calculating the average of Question and Answer after Attention, and combining them. ・ When learning the model, pass the learning data (question ID) and teacher label (answer number). At the time of evaluation, the correct answer rate and the accuracy rate recall rate are calculated. ・ By visualizing Attention, the relationship between the two data can be seen.

This time is over. Thank you for reading until the end.

Recommended Posts

Topic extraction of Japanese text 2 Practical edition
Topic extraction of Japanese text 1 Basics
Python: Japanese text: Characteristic of utterance from word similarity
Python: Japanese text: Characteristic of utterance from word continuity
Japanese localization of Pycharm