[PYTHON] Try Embedding Visualization added in TensorFlow 0.12

Introduction

TensorFlow 0.12 was released the other day. One of its functions is the visualization of embedded expressions. This makes it possible to analyze high-dimensional data interactively.

The following is a visualization of MNIST. The following image is a still image, but on the Official Site, it is a slimy movement in 3D. You can see where it is. embedding_visualization.png

In this article, I tried using Embedding Visualization through Word2vec visualization. First of all, we will install it.

Installation

First, install TensorFlow 0.12. Please install by referring to the following page.

After the installation is complete, we will study for visualization.

Learn the model

First clone the repository, then run the following command to move it:

$ git clone https://github.com/tensorflow/tensorflow.git
$ cd tensorflow/models/embedding

Run the following command to download the training and evaluation data:

$ wget http://mattmahoney.net/dc/text8.zip -O text8.zip
$ unzip text8.zip
$ wget https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip
$ unzip -p source-archive.zip  word2vec/trunk/questions-words.txt > questions-words.txt
$ rm source-archive.zip

Now that we have the data, we will learn the word vector. Run the following command:

$ python word2vec_optimized.py --train_data=text8 --eval_data=questions-words.txt --save_path=/tmp/

Please wait for about an hour to study.

View Embedding in TensorBoard

It will be displayed when the learning is completed. First, start TensorBoard by running the following command:

$ tensorboard --logdir=/tmp/

Once started, it will access the specified address. Then select the Embedding tab to see the visualized vector.

By the way, when I visualized Word2vec, I couldn't understand why there were too many vocabularies.

It seems that using Metadata will allow you to display the word itself instead of the word ID.

If nothing is displayed

When I select the Embedding tab, nothing is displayed in the browser, and the following error may appear on the console. Or rather it came out.

 File "/Users/user_name/venv/lib/python3.4/site-packages/tensorflow/tensorboard/plugins/projector/plugin.py", line 139, in configs
    run_path_pairs.append(('.', self.logdir))
AttributeError: 'dict_items' object has no attribute 'append'

In that case, change line 139 of ** tensorflow / tensorboard / plugins / projector / plugin.py ** of the installed TensorFlow as follows. Then rerun TensorBoard.

- run_path_pairs.append(('.', self.logdir))
+ run_path_pairs = [('.', self.logdir)]

Play around with it

After selecting a certain node (word), when I selected "isolate 101 points", the following was displayed.

スクリーンショット 2016-11-30 17.50.31.png

This means that you are displaying 100 words that are similar to the selected word. To measure the similarity here, you can use the cosine similarity and the Euclidean distance. You can also increase or decrease the number of words displayed by specifying neighbors.

You can also use multiple algorithms for visualization. スクリーンショット 2016-11-30 18.18.57.png

The default is PCA, but you can also use T-SNE or CUSTOM. The image is displayed in 3D, but it can also be displayed in 2D.

in conclusion

It would be even more interesting if you could assign a word as a label. This time, I'll hurry up and keep an introduction like this.

reference

Recommended Posts

Try Embedding Visualization added in TensorFlow 0.12
Try Distributed TensorFlow
Try embedding Python in a C ++ program with pybind11
Try gRPC in Python
Try 9 slices in Python
Try regression with TensorFlow
Image normalization in TensorFlow
Machine language embedding in C language
Try deep learning with TensorFlow
Clipping and normalization in TensorFlow
Try LINE Notify in Python
Random seeds fixed in TensorFlow
Try implementing Yubaba in Python 3
Try TensorFlow MNIST with RNN