[PYTHON] "Learning word2vec" and "Visualization with Tensorboard" on Colaboratory

Introduction

In this article, we will learn word2vec in Colaboratory and visualize it with Tensorboard.

--The output result of TensorBoard will be ** published on the Internet **, so please use only open data. -(If anyone knows how to implement TensorBoard's PROJECTOR without publishing it, please let me know) --Word2vec and Tensorboard will not be explained, so please study separately. -Word2Vec: The amazing power of the word vector that the inventor is surprised at -[Thorough introduction to TensorBoard to visualize all data](Thorough introduction to TensorBoard to visualize all data)

Usage data / what to do

word2vecの学習には、著作権が切れたためにフリーで公開されている青空文庫にある夏目漱石作の「吾輩は猫である」を使用します。

By learning the words in the novel with word2vec, we will verify whether the computer can correctly ** recognize "I am a cat" as "cat" **. (If recognized correctly, my word vector and cat's word vector will be close.)

output

image.png

Implementation

From here, we will implement it using Google Colaboratory.

[Step1] Library installation / import

Install the necessary libraries on the Colaboratory. Use the following two.

--MeCab (+ mecab-ipadic-neologd dictionary) -MeCab uses a free software morphological analyzer to divide sentences into words. -By using the mecab-ipadic-neologd dictionary, you can correctly recognize proper nouns such as those on wikiedia. -(Reference) [I examined the effect of "mecab-ipadic-NEologd" which is strong against new words and named entities](https://engineering.linecorp.com/ja/blog/mecab-ipadic-neologd-new-words- and-expressions /)

MeCab(+mecab-ipadic-neologd)Installation of


!apt-get -q -y install sudo file mecab libmecab-dev mecab-ipadic-utf8 git curl python-mecab > /dev/null
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git > /dev/null 
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n > /dev/null 2>&1
!pip install mecab-python3 > /dev/null

Install Tensorboard X


!pip install tensorboardX
%load_ext tensorboard

Import the installed and standard libraries.

Library import


import re
import MeCab
import torch
from gensim.models import word2vec
from tensorboardX import SummaryWriter
from itertools import chain

[Step 2] Download data

青空文庫のサイトから、「吾輩は猫である」のzipファイルをColaboratory上でダウンロードし、解凍します。

Download and unzip the "I am a cat" zip file


!wget https://www.aozora.gr.jp/cards/000148/files/789_ruby_5639.zip
!unzip 789_ruby_5639.zip

Then, "wagahaiwa_nekodearu.txt" will appear, so read the file.

Data reading


f = open('./wagahaiwa_nekodearu.txt', 'r', encoding='shift-jis')
texts = [t.strip() for t in f.readlines()]
f.close()

Let's output the file.

Output data


texts
['I am a cat',
 'Natsume Soseki',
 '',
 '-------------------------------------------------------',
 '[About the symbols that appear in the text]',
 '',
 '"":ruby',
 '(Example) I "My Yes"',
 '',
 '|: Symbol that identifies the beginning of a character string with ruby',
 '(Example) Ichiban | Evil 《Doaku》',
 '',
 '[#]: Input person note\u3000 Mainly explanation of external characters and designation of emphasis marks',
 '(Numbers are JIS X 0213 area code points or Unicode, base page and number of lines)',
 '(Example) * [# "Word + Making a mound", Level 4 2-88-74]',
 '',
 '[]: Enclose the accent-decomposed European language',
 '(Example) [Quid aliud est mulier nisi amicitiae& inimica〕',
 'Please refer to the following URL for details on accent decomposition',
 'http://www.aozora.gr.jp/accent_separation.html',
 '-------------------------------------------------------',
 '',
 '[# 8 indentation] 1 [# "1" is the middle heading]',
 '',
 'I am a cat. There is no name yet.',
 'I have no idea where I was born. I remember only crying in a dim and damp place. I saw human beings for the first time here. Moreover, I heard later that it was the most evil race of human beings called Shosei. This student is a story that sometimes catches us, simmers them, and eats them. However, I didn't think anything at that time, so I didn't think it was particularly scary. However, when it was placed on his palm and lifted up, it just felt fluffy. It is probably the beginning of what is called a human being that calms down a little on the palm and sees the student's face. The feeling that I thought was strange at this time still remains. The face, which should be decorated with the first hair, is slippery and looks like a kettle. After that, I met a lot of cats, but I have never met such a one-wheeled cat. Not only that, the center of the face is too protruding. Then, from the inside of the hole, I sometimes blow smoke. Apparently my throat was so weak that I was really weak. It was around this time that I finally learned that this is a cigarette that humans drink.',
 'I sat in a good mood for a while behind the palm of this student, but after a while I started driving at a very high speed. I don't know if the student will move or only I will move, but my eyes turn to the darkness. I feel sick. When I thought that it wouldn't help at all, I heard a loud noise and a fire broke out in my eyes. Until then, I remember it, but I don't know what to do or how much I try to come up with.',
 'When I suddenly noticed, there was no student. There are many brothers, and I can't even see Piki. Even the mother of Kanjin, who is important, has disappeared. On top of that, it's bright and dark, unlike the places I've been up to now. I can't even open my eyes. If Hatena's Yoko is strange, it hurts very much when I take it out. I was suddenly abandoned into Sasahara from the top of the straw.',
・ ・ ・]

Looking at the output results, we can see the following.

  1. The explanation of the novel (text) is written before and after the sentence.
  2. According to the explanation, ruby and inputter's note are described in the novel.
  3. Multiple sentences are described in one element of the list

We will perform these preprocessing in the next step.

[Step 3] Data preprocessing

Here, we will prepare a function for preprocessing sentences.

Function for sentence preprocessing


def preprocessTexts(texts):
  # 1.Deleted the description of the novel before and after the sentence
  texts = texts[23:-17]

  # 2.Deleted ruby / delimiter / inputter note / accent-decomposed European text
  signs = re.compile(r'(《.*?》)|(|)|([#.*?])|(〔.*?〕)|(\u3000)')
  texts = [signs.sub('',t)  for t in texts]

  # 3.Divide the sentence with "."
  texts = [t.split('。') for t in texts]
  texts = list(chain.from_iterable(texts))

  #Delete sentences of one character or less (because it is not a sentence)
  texts = [t for t in texts if len(t) > 1]

  return texts

Preprocessing


texts = preprocessTexts(texts)
print('Number of sentences:', len(texts))
Number of sentences: 9058

Preprocessing has resulted in a list of sentences that word2vec can learn.

[Step 4] Learning word2vec

It's finally time to learn word2vec. Before you can learn, you need to divide the sentence (divided into words). Use the installed MeCab to separate each sentence.

Define a function to be divided.

Function for word-separation


def getWords(sentence, tokenizer, obj_pos=['all'], symbol=False):
    """
Divide a sentence into words (separately)

    Parameters
    ----------
    sentence : str
Sentences to be divided
    tokenizer : class
MeCab tokenizer
    obj_pos : list of str, default ['all']
Part of speech to get
    symbol : bool, default False
Whether to include symbols

    Returns
    --------
    words : list of str
Sentences divided by word
    """

    node = tokenizer.parseToNode(sentence)
    words = []

    while node:
        results = node.feature.split(",")
        pos = results[0] #Part of speech
        word = results[6] #Uninflected words
        if pos != "BOS/EOS" and (pos in obj_pos or 'all' in obj_pos) and (pos!='symbol' or symbol):
            if word == '*':
                word = node.surface
            words.append(word)
        node = node.next
    return words

Use the function to divide the words. At this time, prepare two patterns of word-separation results.

  1. Results of dividing all parts of speech ⇒ Used for learning word2vec
  2. Word-separated result of only nouns ⇒ Used for TensorBoard (to simplify visualization of TensorBoard)

Word-separation of sentences


#Set Tokenizer
path = "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"
tokenizer = MeCab.Tagger(path)

#Divide the text
words = [getWords(t, tokenizer, symbol=True) for t in texts]

#Get a noun
nouns = [getWords(t, tokenizer, obj_pos=['noun']) for t in texts]
nouns = set(chain.from_iterable(nouns)) #A set of nouns that appear

From here, it is learning of word2vec. Set the parameters as follows.

Parameters value Description
size 300 Number of dimensions of word vector
sg 1 Algorithm to use(skip-gram:1, C-BOW:0)
min_count 2 The number of appearances is min_Ignore words less than count
seed 0 Random seed

Generally, skip-gram, which is said to have high accuracy, is used, and the seed value is set to ensure reproducibility. size and min_count are rules of thumb ~~ (appropriate name) ~~. Other parameters are left at their defaults.

Parameter setting


size = 300
sg = 1
min_count = 2
seed = 0

Next is model learning. In order to make the visualization easy to understand, the vector of the learning result is set to the L2 norm.

Learning word2vec


model = word2vec.Word2Vec(words, size=size, min_count=min_count, sg=sg, seed=seed)
model.init_sims(replace=True)

This completes the learning of word2vec. This time, in order to simplify visualization, we will narrow down to the top 500 nouns that appear frequently.

Storage of learning results


#Get a list of distributed expressions and words
word_vectors = model.wv.vectors
index2word = model.wv.index2word

#Get noun index
nouns_id = [i for i, n in enumerate(index2word) if n in nouns]

#Extract the top 500 words whose part of speech is a noun
word_vectors = word_vectors[nouns_id][:500]
index2word = [index2word[i] for i in nouns_id][:500]

[Step 5] Visualization by Tensorboard

Finally, the learned word2vec is visualized by Tensorbord. Outputs a file for visualization. You can easily output by using the library TensorbordX.

Output file for running Tensorboard


writer = SummaryWriter('./runs')
writer.add_embedding(torch.FloatTensor(word_vectors), metadata=index2word)
writer.close()

Execute the output file. You can run Tensorboard in Colaboratory by using ngrok.

Run TensorBoard


LOG_DIR = './runs'
get_ipython().system_raw(
    'tensorboard --logdir={} --host 0.0.0.0 --port 6006 &'
    .format(LOG_DIR)
)
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('./ngrok http 6006 &')
!curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

When you execute the above code, a URL like "http://XXXXXXXX.ngrok.io" will be output, so you can see the TensorBoard when you access it! That's it!

Check the output result

When you access it, it looks like the following. Please wait for a while or change "** IN ACTIVE " in the upper right to " PROJECTOR **".

image.png

Then, the result summarized in 3D by PCA will appear. image.png

Please change from PCA on the left to T-SNE. It is further aggregated and similar words are put together. This is a screenshot, but it's really interesting because you can see the learning process moving.

image.png

When the learning has converged, let's look at the similarity (distance) between "I" and "Cat", which is the main subject. Enter "I" from the search on the right and find the "I" point.

image.png

There is "I" in the lower left and "Cat" in the upper right. The distance is not very similar to 0.317, but it turns out to be reasonably similar. It is thought that this happened because "I" and "cat" rarely appear in the same context.

Looking at the words with high similarity of each word, Words that are close to "I" are personally named "he" and "they" at the top, and words that are closest to "cat" are "humans" and are grouped together by animals. Looking at similar words, it seems that they are learning well.

Summary

This time, we learned word2vec on Colaboratory and visualized it with TensorBoard. It is very convenient to be able to implement it easily without building an environment. It would be even more convenient if PROJECTOR could be visualized on Colaboratoy without publishing it on the Internet! I hope that PROJECTOR will be visualized.

I have omitted detailed explanations, so please check the reference articles for detailed explanations of terms and libraries.

Reference article

[Data visualization] Run TensorBoard Projector with Keras and Colaboratory [Until using Mecab-ipadic-Neologd with Google Colaboratory](https://shunyaueta.com/posts/2018-04-23_google-colaboratory-%E3%81%A7-mecabipadicneologd-%E3%82%92%E4% BD% BF% E3% 81% 86% E3% 81% BE% E3% 81% A7 /) Gemsim word2vec option list ngrok is too convenient

Recommended Posts

"Learning word2vec" and "Visualization with Tensorboard" on Colaboratory
Deep Learning with Shogi AI on Mac and Google Colab
Reinforcement learning 23 Create and use your own module with Colaboratory
Let's move word2vec with Chainer and see the learning progress
Interactive visualization with ipywidgets and Bokeh
Deep Learning with Shogi AI on Mac and Google Colab Chapter 11
Deep Learning with Shogi AI on Mac and Google Colab Chapters 1-6
Deep Learning with Shogi AI on Mac and Google Colab Chapter 8
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 10 6-9
Deep Learning with Shogi AI on Mac and Google Colab Chapter 10
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 5-7
Deep Learning with Shogi AI on Mac and Google Colab Chapter 9
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 1-2
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3 ~ 5
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 9
Deep Learning with Shogi AI on Mac and Google Colab Chapter 8 5-9
Deep Learning with Shogi AI on Mac and Google Colab Chapter 8 1-4
Deep Learning with Shogi AI on Mac and Google Colab Chapter 12 3
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 8
Deep Learning with Shogi AI on Mac and Google Colab Chapter 7 1-4
A memo with Python2.7 and Python3 on CentOS
Machine learning with Pytorch on Google Colab
Easy! Use gensim and word2vec with MAMP.
Learn with Shogi AI Deep Learning on Mac and Google Colab Use Google Colab
Deep Learning on Mac and Google Colab Words Learned with Shogi AI
100 language processing knock-90 (using Gensim): learning with word2vec
[Python] I introduced Word2Vec and played with it.
Install OpenCV 4.0 and Python 3.7 on Windows 10 with Anaconda
Implement "Data Visualization Design # 3" with pandas and matplotlib
Looking back on learning with Azure Machine Learning Studio
Use Python and word2vec (learned) with Azure Databricks
Word2Vec with BoUoW
Steps to quickly create a deep learning environment on Mac with TensorFlow and OpenCV
Until you create a machine learning environment with Python on Windows 7 and run it
Record temperature and humidity with systemd on Raspberry Pi
Machine learning with Raspberry Pi 4 and Coral USB Accelerator
Notes on HDR and RAW image processing with Python
Install selenium on Mac and try it with python
Deep learning image analysis starting with Kaggle and Keras
Easy machine learning with scikit-learn and flask ✕ Web app
Automatic follow on Twitter with python and selenium! (RPA)
Overview and tips of seaborn with statistical data visualization
Get comments on youtube Live with [python] and [pytchat]!
[PyTorch Tutorial ⑦] Visualizing Models, Data, And Training With Tensorboard
Extract music features with Deep Learning and predict tags
From running MINST on TensorFlow 2.0 to visualization on TensorBoard (2019 edition)
Email hipchat with postfix, fluentd and python on Azure
Automate Chrome with Python and Selenium on your Chromebook
Dealing with pip and related installation errors on Ubuntu 18.04
Troubleshoot with installing OpenCV on Raspberry Pi and capturing
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
Set up python and machine learning libraries on Ubuntu