TensorFlow 0.12 was released the other day. One of its functions is the visualization of embedded expressions. This makes it possible to analyze high-dimensional data interactively.
The following is a visualization of MNIST. The following image is a still image, but on the Official Site, it is a slimy movement in 3D. You can see where it is.
In this article, I tried using Embedding Visualization through Word2vec visualization. First of all, we will install it.
First, install TensorFlow 0.12. Please install by referring to the following page.
After the installation is complete, we will study for visualization.
First clone the repository, then run the following command to move it:
$ git clone https://github.com/tensorflow/tensorflow.git
$ cd tensorflow/models/embedding
Run the following command to download the training and evaluation data:
$ wget http://mattmahoney.net/dc/text8.zip -O text8.zip
$ unzip text8.zip
$ wget https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip
$ unzip -p source-archive.zip word2vec/trunk/questions-words.txt > questions-words.txt
$ rm source-archive.zip
Now that we have the data, we will learn the word vector. Run the following command:
$ python word2vec_optimized.py --train_data=text8 --eval_data=questions-words.txt --save_path=/tmp/
Please wait for about an hour to study.
It will be displayed when the learning is completed. First, start TensorBoard by running the following command:
$ tensorboard --logdir=/tmp/
Once started, it will access the specified address. Then select the Embedding tab to see the visualized vector.
By the way, when I visualized Word2vec, I couldn't understand why there were too many vocabularies.
It seems that using Metadata will allow you to display the word itself instead of the word ID.
When I select the Embedding tab, nothing is displayed in the browser, and the following error may appear on the console. Or rather it came out.
File "/Users/user_name/venv/lib/python3.4/site-packages/tensorflow/tensorboard/plugins/projector/plugin.py", line 139, in configs
run_path_pairs.append(('.', self.logdir))
AttributeError: 'dict_items' object has no attribute 'append'
In that case, change line 139 of ** tensorflow / tensorboard / plugins / projector / plugin.py ** of the installed TensorFlow as follows. Then rerun TensorBoard.
- run_path_pairs.append(('.', self.logdir))
+ run_path_pairs = [('.', self.logdir)]
After selecting a certain node (word), when I selected "isolate 101 points", the following was displayed.
This means that you are displaying 100 words that are similar to the selected word. To measure the similarity here, you can use the cosine similarity and the Euclidean distance. You can also increase or decrease the number of words displayed by specifying neighbors.
You can also use multiple algorithms for visualization.
The default is PCA, but you can also use T-SNE or CUSTOM. The image is displayed in 3D, but it can also be displayed in 2D.
It would be even more interesting if you could assign a word as a label. This time, I'll hurry up and keep an introduction like this.
Recommended Posts