Text mining by word2vec etc. by python ([High school information department information II] teaching materials for teacher training)

Introduction

In the past article, I did a simple text mining based on the work of Aozora Bunko. https://qiita.com/ereyester/items/7c220a49c15073809c33 This time, I would like to use Word2vec to explore the similarity of words. There are many other articles about Word2vec, so I won't explain it in detail. Understanding Word2Vec [Python] How to use Word2Vec This time, I would like to focus on the function of word2vec.Word2Vec () in gensim.models of gensim. Mechanism of Word2vec understood by pictures

Teaching materials

[High School Information Department "Information II" Teacher Training Materials (Main Volume): Ministry of Education, Culture, Sports, Science and Technology](https://www.mext.go.jp/a_menu/shotou/zyouhou/detail/mext_00742.html "High School Information Department "Information II" teaching materials for teacher training (main part): Ministry of Education, Culture, Sports, Science and Technology ") Chapter 3 Information and Data Science Second Half (PDF: 7.6MB)

environment

Parts to be taken up in the teaching materials

Learning 18 Text mining and image recognition: "2. Text mining using MeCab"

Implementation example and result in python

Preparation

In python, load a package called gensim for machine learning with Word2vec.

!pip install gensim

Next, download the sentiment dictionary for sentiment analysis to be performed later. Sentiment analysis is possible by finding the distance from the main terms of the terms that indicate emotions with Word2vec during sentiment analysis, but here, as a Japanese dictionary, sentiment analysis is performed using the PN Table of Tokyo Institute of Technology. I will.

import urllib.request
import pandas as pd

#PN table link
url = 'http://www.lr.pi.titech.ac.jp/~takamura/pubs/pn_ja.dic'

#File save name
file_path = 'pn_ja.dic'

with urllib.request.urlopen(url) as dl_file:
    with open(file_path, 'wb') as out_file:
        out_file.write(dl_file.read())

#Read the dictionary
dic = pd.read_csv('/content/pn_ja.dic', sep = ':', encoding= 'shift_jis', names = ('word','reading','Info1', 'PN'))

print(dic)

The execution result is as follows.

       word reading Info1        PN
0 Excellent verb 1.000000
1 good good adjective 0.999995
2 Rejoice Joyful verb 0.999979
3 Compliment Compliment Verb 0.999979
4 MedetaiMedetai Adjective 0.999645
...     ...     ...   ...       ...
55120 No No Auxiliary verb-0.999997
55121 Terrible terrible adjectives-0.999997
55122 Illness noun-0.999998
55123 Die no verb-0.999999
55124 Bad bad adjectives-1.000000

[55125 rows x 4 columns]

Next, install Mecab.

(Added at 19:00 on 2020/09/18) The author of mecab-python3 pointed out that it is not necessary to install mecab, libmecab-dev, ipadic with aptitude before installing mecab-python3, so I will fix it.

!pip install mecab-python3
!pip install unidic-lite

Model building and text analysis with Word2vec

In order to learn with word2vec, convert the sentence to be analyzed into a word-separated text and save it. Follow the steps below. (1) Download and read the text data of "Botchan" in order to perform text analysis on Natsume Soseki's "Botchan". (2) Remove ruby, annotations, etc. ③ Extract nouns, adjectives, and verbs from the text of "Bo-chan", remove numbers and non-independent words, convert them to "separate writing", and convert them to a file called tf.txt.

from collections import Counter
import MeCab    #Read MeCab
import zipfile
import os.path,glob
import re

#Specify the URL of "Bochan"
url = 'https://www.aozora.gr.jp/cards/000148/files/752_ruby_2438.zip'

#Zip file save name
file_path = 'temp.zip'

#Open the boy's file and delete the read file
with urllib.request.urlopen(url) as dl_file:
    with open(file_path, 'wb') as out_file:
        out_file.write(dl_file.read())
        with zipfile.ZipFile(file_path) as zf:
            listfiles = zf.namelist()
            zf.extractall()

os.remove(file_path)

# shift_Read with jis
with open(listfiles[0], 'rb') as f:
    text = f.read().decode('shift_jis')

#Removal of ruby, annotations, etc.
text = re.split(r'\-{5,}', text)[2]
text = re.split(r'Bottom book:', text)[0]
text = re.sub(r'《.+?》', '', text)
text = re.sub(r'[#.+?]', '', text)
text = text.strip()

#Prepare to use MeCab
tagger = MeCab.Tagger()

#Error if not initialized
tagger.parse("")

#Morphological analysis with NMeCab
node = tagger.parseToNode(text)
word_list_raw = []
result_dict_raw = {}
#Extract nouns, adjectives, and verbs
wordclass_list = ['noun','adjective','verb']
#Excludes numbers, non-independence, pronouns and suffixes
not_fine_word_class_list = ["number", "Non-independent", "Pronoun","suffix"]

while node:
    #Get more information
    word_feature = node.feature.split(",")
    #Get words (in principle, uninflected words)
    word = node.surface
    #Get part of speech
    word_class = word_feature[0]
    fine_word_class = word_feature[1]
    #Specify what to extract from part of speech and what to exclude
    if ((word not in ['', ' ','\r', '\u3000']) \
        and (word_class in wordclass_list) \
        and (fine_word_class not in not_fine_word_class_list)):
        #word list
        word_list_raw.append(word)
        result_dict_raw[word] = [word_class, fine_word_class]
    #Advance to the next word
    node = node.next
print(word_list_raw)

wakachi_text = ' '.join(word_list_raw);

#wakachi file save name
file2_path = 'tf.txt'

with open(file2_path, 'w') as out_file:
    out_file.write(wakachi_text)

print(wakachi_text)

The execution result is as follows.

['one', 'Handing over', 'Gunless gun', 'Small offering', 'Time', 'loss', 'Shi', 'Is', 'school',…
One parent's hand-held non-gun small service Lost time School is at school Time is jumping downstairs ...

Next is the construction of the model by word2vec, which is the main part of this article.

from gensim.models import word2vec
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentence_data = word2vec.LineSentence('tf.txt')
model_bochan = word2vec.Word2Vec(sentence_data,
                         sg=1,        # Skip-gram
                         size=100,    #Number of dimensions
                         min_count=5, # min_Discard words less than count times
                         window=12,   #Maximum number of words in context
                         hs=0,        #Hierarchy Softmax(0 for negative sampling)
                         negative=5,  #Negative sampling
                         iter=10      #Number of Epoch
                         )

model_bochan.save('test.model')

Import word2vec of gensim module and build model with Word2Vec.

Two learning models are available in word2vec.

I will omit these two explanations, but since skip-gram is used in the teaching materials, skip-gram is used according to that. (In general, skip-gram shows better performance in CBOW and skip-gram)

The number of dimensions of the word vector is the same as the default, but it is set to 100. The condition for discarding words is to ignore words that appear less than 5 times. The maximum number of words before and after recognizing as a context is 12.

There are two algorithms that speed up learning. -Hierarchical Softmax -Negative Sampling I will omit these two explanations. I'm using Negative Sampling here.

This explanation is http://tkengo.github.io/blog/2016/05/09/understand-how-to-learn-word2vec/ Is detailed.

The number of corpus iterations is specified as 10. This is the number of epochs that indicates how many times one training data is trained by the neural network.

This allowed us to build the model.

Next, let's look at the similarity of words as well as the teaching materials. Look up the word red.

model   = word2vec.Word2Vec.load('/content/test.model')
results = model.most_similar(positive=['Red'], topn=100)

for result in results:
    print(result[0], '\t', result[1])

The execution result is as follows.

:
2020-09-16 12:30:12,986 : INFO : precomputing L2-norms of word weight vectors
Shirt 0.9854607582092285
Annoying 0.9401918053627014
First name 0.9231084585189819
Gorki 0.9050831198692322
Know 0.8979452252388
Gentle 0.897865891456604
Agree 0.8932155966758728
Russia Aya 0.8931306004524231
Madonna 0.890703558921814
:

As mentioned above, it turned out that the characters such as red shirts came to the top. Next, as an example of subtracting the elements of the model, let's subtract "shirt" from "Madonna".

model = word2vec.Word2Vec.load('/content/test.model')

results = model.most_similar(positive=['Madonna'], negative=['shirt'], topn=100)
for result in results:
    print(result[0], '\t', result[1])

The execution result is as follows.

:
INFO : precomputing L2-norms of word weight vectors
Voice 0.2074282020330429
Geisha 0.1831434667110443
Dumpling 0.13945674896240234
Entertainment 0.13744047284126282
Tempura 0.11241232603788376
Grasshopper 0.10779635608196259
Teacher 0.08393052220344543
Spirit 0.08120302855968475
Kindness 0.0712042897939682
:

By subtracting the shirt element from Madonna, we were able to extract elements such as voice, teacher, geisha, and kindness.

The following is detailed about the addition and subtraction of elements by word2vec. https://www.pc-koubou.jp/magazine/9905

Simple sentiment analysis with PN table

Sentiment analysis can be done by finding the distance from the main terms of emotions in Word2Vec, but here I would like to analyze with the emotion dictionary (PN Table) read above as in the teaching material.

First, convert the dictionary from dataframe type to dict type to make it easy to handle.

dic2 = dic[['word', 'PN']].rename(columns={'word': 'TERM'})

#Convert PN Table from data frame to dict type
word_list = list(dic2['TERM'])
pn_list = list(dic2['PN'])  #The type of contents is numpy.float64

pn_dict = dict(zip(word_list, pn_list))

print(pn_dict)

The execution result is as follows.

{'Excellent': 1.0, 'good': 0.9999950000000001, 'Rejoice': 0.9999790000000001, 'praise': 0.9999790000000001, 'Congratulations': 0.9996450000000001,…

Positive terms have a value close to 1, and negative terms have a value close to -1.

Next, take the nouns and adjectives from "Bo-chan" and remove the numbers and suffixes. Try to display positive and negative words by combining a word frequency table and an emotion dictionary.

#Prepare to use MeCab
tagger = MeCab.Tagger()

#Error if not initialized
tagger.parse("")

#Morphological analysis with NMeCab
node = tagger.parseToNode(text)
word_list_raw = []
extra_result_list = []
#Extract nouns, adjectives, and verbs
wordclass_list = ['noun','adjective']
#Excludes numbers, non-independence, pronouns and suffixes
not_fine_word_class_list = ["number","suffix", "Non-independent"]

while node:
    #Get more information
    word_feature = node.feature.split(",")
    #Get words (in principle, uninflected words)
    word = node.surface
    #Get part of speech
    word_class = word_feature[0]
    fine_word_class = word_feature[1]
    #Specify what to extract from part of speech and what to exclude
    if ((word not in ['', ' ','\r', '\u3000']) \
        and (word_class in wordclass_list) \
        and (fine_word_class not in not_fine_word_class_list)):
        #word list
        word_list_raw.append(word)
    #Advance to the next word
    node = node.next

freq_counterlist_raw = Counter(word_list_raw)
dict_freq_raw = dict(freq_counterlist_raw)

extra_result_list = []
for k, v in dict_freq_raw.items():
    if k in pn_dict:
        extra_result_list.append([k, v, pn_dict[k]])

extra_result_pn_sorted_list = sorted(extra_result_list, key=lambda x:x[2], reverse=True)
print("Positive words")
display(extra_result_pn_sorted_list[:10])
print("Negative words")
display(extra_result_pn_sorted_list[-10:-1])

The execution result is as follows.

Positive words
[['Congratulations', 1, 0.9996450000000001],
 ['good', 2, 0.9993139999999999],
 ['happy', 1, 0.998871],
 ['Assortment', 1, 0.998208],
 ['Credit', 2, 0.997308],
 ['justice', 1, 0.9972780000000001],
 ['Impressed', 10, 0.997201],
 ['Excuse me', 1, 0.9967889999999999],
 ['Encouragement', 1, 0.9959040000000001],
 ['appropriateness', 1, 0.995553]]
Negative words
[['Rough', 13, -0.9993340000000001],
 ['narrow', 7, -0.999342],
 ['Cold', 1, -0.999383],
 ['punishment', 5, -0.9994299999999999],
 ['enemy', 3, -0.9995790000000001],
 ['painful', 1, -0.9997879999999999],
 ['poor', 6, -0.9998309999999999],
 ['Absent', 338, -0.9999969999999999],
 ['sick', 6, -0.9999979999999999]]

The second element of the list is the frequency of appearance (number of times), and the third is the value that shows whether it is positive or negative. Finally, let's look at which of the positive and negative words is often used for "Bo-chan" as a whole. (However, the frequency of appearance of words is not used according to the teaching materials)

pos_n = sum(x[2] > 0 for x in extra_result_pn_sorted_list)

print(pos_n)

neg_n = sum(x[2] < 0 for x in extra_result_pn_sorted_list)

print(neg_n)

The execution result is as follows.

182
1914

I found that negative words are often used in "Bo-chan".

comment

Regarding the processing related to word2vec, similar results were not obtained with python and R, so I will try to find the cause if I can afford it in the future.

Source code

https://gist.github.com/ereyester/101ae0da17e747b701b67fe9fe137b84

Recommended Posts

Text mining by word2vec etc. by python ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I / Information II] Summary of teaching materials for teacher training by python
Binary classification by decision tree by python ([High school information department information II] teaching materials for teacher training)
Classification by k-nearest neighbor method (kNN) by python ([High school information department information II] teaching materials for teacher training)
Data analysis by clustering using k-means method (python) ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I] Teaching materials for teacher training: Data format and visualization (python)
Principal component analysis with python (Scikit-learn version, pandas & numpy version) ([High school information department information II] teaching materials for teacher training)
Object detection using YOLO (python) ([High School Information Department Information II] Teacher training materials)
[High School Curriculum Guidelines Information I] Teaching materials for teacher training: Implementation of Huffman method in python
[High School Information Department] Information I / Information II Reiwa 3rd year supplementary teaching materials Exercise examples
Web teaching materials for learning Python