[PYTHON] 100 amateur language processing knocks: 90

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 10: Vector Space Law (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

90. Learning with word2vec

Apply word2vec to the corpus created in> 81 and learn the word vector. In addition, convert the form of the learned word vector and run the program 86-89.

The finished code:

main.py


# coding: utf-8
import pickle
from collections import OrderedDict
import numpy as np
from scipy import io
import word2vec

fname_input = 'corpus81.txt'
fname_word2vec_out = 'vectors.txt'
fname_dict_index_t = 'dict_index_t'
fname_matrix_x300 = 'matrix_x300'

#Vectorization with word2vec
word2vec.word2vec(train=fname_input, output=fname_word2vec_out,
	size=300, threads=4, binary=0)

#Read the result and create a matrix and dictionary
with open(fname_word2vec_out, 'rt') as data_file:

	#Get the number of terms and dimensions from the first line
	work = data_file.readline().split(' ')
	size_dict = int(work[0])
	size_x = int(work[1])

	#Dictionary and matrix creation
	dict_index_t = OrderedDict()
	matrix_x = np.zeros([size_dict, size_x], dtype=np.float64)

	for i, line in enumerate(data_file):
		work = line.strip().split(' ')
		dict_index_t[work[0]] = i
		matrix_x[i] = work[1:]

#Export results
io.savemat(fname_matrix_x300, {'matrix_x300': matrix_x})
with open(fname_dict_index_t, 'wb') as data_file:
	pickle.dump(dict_index_t, data_file)

Execution result:

When main.py is executed, "corpus81.txt" created in Problem 81 is read and learned with word2vec, and "matrix_x300.mat" and "dict_index_t" Output to. The processing time was about 4 minutes on my machine.

The format of the two files created is the same as that created in Problem 85, so if you replace the files, [Problem 86](http://qiita. The programs from com / segavvy / items / d0cfabf328fd6d67d003) to Problem 89 can be executed as they are. Below are the results.

Results of Problem 86

It displays the "United_States" vector, so only 300 numbers are lined up.

Execution result of problem 86


[  2.32081000e-01   1.34141400e+00   7.57177000e-01   9.18121000e-01
   1.41462400e+00   4.61902000e-01  -3.19372000e-01  -9.11796000e-01
   6.74263000e-01   8.88596000e-01   8.66489000e-01   4.41949000e-01
  -6.52780000e-02  -5.73398000e-01  -1.72020000e-01   2.79280000e-01
  -1.61161000e-01   4.50549000e-01   7.46780000e-02  -3.13907000e-01
  -4.32671000e-01   6.18620000e-02  -1.27725100e+00   6.85341000e-01
   3.03760000e-02  -3.19811000e-01  -7.68924000e-01  -2.62472000e-01
   4.91034000e-01   9.34251000e-01  -6.05433000e-01  -5.19170000e-02
  -6.72454000e-01   1.55326600e+00  -7.37928000e-01   1.66526200e+00
  -6.69270000e-02   8.88963000e-01  -6.68554000e-01   2.86349000e-01
  -1.27271300e+00  -1.21432000e-01   1.26359000e+00   1.25684600e+00
   1.97781000e-01   8.14802000e-01   2.05766000e-01  -4.26121000e-01
   7.07411000e-01   7.51749000e-01   6.40161000e-01  -3.28497000e-01
   4.20656000e-01   4.26616000e-01  -2.29688000e-01  -4.02054000e-01
  -2.33294000e-01  -6.42150000e-02  -7.11624000e-01   1.82619000e-01
  -7.58055000e-01  -2.03132000e-01   5.12000000e-04   1.31971700e+00
   1.03481400e+00   2.22623000e-01   6.24024000e-01   9.64505000e-01
  -7.62032000e-01  -3.60960000e-02   4.45112000e-01  -5.08120000e-01
  -1.00680500e+00  -2.55381000e-01   8.55365000e-01   6.17396000e-01
  -7.78720000e-01  -6.18505000e-01   1.21397000e-01  -1.69275000e-01
   6.60319000e-01  -3.36548000e-01  -5.62175000e-01  -2.04378300e+00
  -7.94834000e-01  -4.65775000e-01  -7.54679000e-01   3.90806000e-01
  -8.01828000e-01  -4.92555000e-01   3.47642000e-01  -4.28183000e-01
  -1.99666800e+00   1.82001000e-01  -1.70085000e-01   9.28966000e-01
  -1.96638600e+00   9.23961000e-01   4.84498000e-01  -5.24912000e-01
   1.02234000e+00   4.62904000e-01   4.10672000e-01   6.97174000e-01
   6.19435000e-01   8.32230000e-02   1.41234000e-01   6.12439000e-01
  -1.45182000e+00   1.85729000e-01   5.67926000e-01  -3.29128000e-01
  -3.83217000e-01   3.79447000e-01  -5.50135000e-01  -4.12838000e-01
  -4.16418000e-01   1.05820000e-02   6.92200000e-02  -6.27480000e-02
   1.24219800e+00  -3.96815000e-01  -4.01746000e-01  -6.71752000e-01
   7.81617000e-01  -8.54749000e-01  -1.07806700e+00   7.44280000e-02
  -1.91329200e+00  -1.21407300e+00  -5.23873000e-01  -1.01673500e+00
   4.35801000e-01   1.73546700e+00  -7.54100000e-01  -5.14167000e-01
  -2.15539000e-01  -6.96321000e-01   1.45136000e-01   6.40906000e-01
  -4.21082000e-01  -3.60932000e-01  -2.98236100e+00   1.05500300e+00
  -5.42376000e-01   2.06387000e-01   2.28400000e-02  -1.87433000e-01
  -4.26448000e-01  -7.00808000e-01  -1.91694000e-01  -6.80270000e-02
   8.37304000e-01   6.18913000e-01   3.09183000e-01  -2.22531000e-01
  -3.08164000e-01   1.91496000e+00  -2.05698000e-01  -1.38298000e+00
   1.08415000e-01   5.35886000e-01  -2.32130000e-02   6.94406000e-01
  -4.17144000e-01  -1.90199000e+00   6.69315000e-01  -6.32312000e-01
  -3.45570000e-02  -6.03989000e-01   3.56266000e-01  -1.02690000e+00
   4.67688000e-01   5.27140000e-02   3.66741000e-01   1.92638600e+00
   6.22386000e-01   4.83680000e-01   1.00020800e+00   4.46445000e-01
   4.13120000e-01   2.12195000e-01   1.56286000e-01   1.33522500e+00
   6.97672000e-01   5.66884000e-01   1.53622000e-01   6.39750000e-01
  -2.03707000e-01   2.10565800e+00  -1.17320000e-01   8.55233000e-01
   2.61317700e+00  -2.14519000e-01   8.55025000e-01   9.06171000e-01
  -4.56919000e-01  -1.40941000e-01  -6.24079000e-01  -1.26463800e+00
  -9.31688000e-01   9.94177000e-01  -6.76021000e-01  -9.58533000e-01
   4.40553000e-01  -1.23600000e-03  -5.81909000e-01   3.57520000e-01
  -7.99588000e-01   1.11611700e+00  -4.93985000e-01   1.23746500e+00
  -7.51088000e-01  -9.28216000e-01   3.05621000e-01  -5.11757000e-01
   1.05883000e-01   4.88388000e-01   8.31103000e-01  -5.05967000e-01
  -1.01836400e+00  -2.54270000e-01  -4.25978000e-01   2.21318000e-01
  -7.14479000e-01   3.37610000e-01  -6.56314000e-01  -3.55550000e-01
   2.31042000e-01  -9.86197000e-01  -7.63255000e-01   1.04544800e+00
   1.57370400e+00   1.95025900e+00   5.00542000e-01  -5.48677000e-01
   5.21174000e-01  -2.04218000e-01  -2.11823000e-01  -2.30830000e-01
   1.45851700e+00  -2.69244000e-01  -8.57567000e-01   1.28116000e+00
   1.18514300e+00   7.82615000e-01  -7.24170000e-02  -1.07394300e+00
  -5.76223000e-01   5.17903000e-01   6.55052000e-01   1.56492300e+00
   1.58710000e-01   1.64205300e+00   4.27021000e-01   1.65960000e-01
   1.27899000e-01   2.45154000e-01  -3.33136000e-01   3.69693000e-01
   6.90610000e-01  -5.47800000e-01   1.87585000e-01   6.63304000e-01
  -1.18193300e+00  -3.42415000e-01  -1.97505000e-01   1.55585000e+00
   6.80237000e-01   7.02119000e-01  -1.39572100e+00  -2.07230000e-02
  -4.62809000e-01  -4.94772000e-01   2.25839000e-01   3.32944000e-01
  -7.71918000e-01  -8.55043000e-01  -5.98472000e-01  -1.60165800e+00
  -3.56646000e-01  -3.89552000e-01  -7.58449000e-01   2.03913000e-01
   2.84149000e-01  -5.72755000e-01  -4.92234000e-01  -1.15743600e+00
  -5.41931000e-01  -7.22312000e-01   8.08674000e-01  -3.62800000e-02
   2.92228000e-01   4.90371000e-01   5.50050000e-01   1.82185000e-01
  -2.12689000e-01  -1.03393500e+00   1.97234000e-01  -2.13381000e-01]
Results of Problem 87

The cosine similarity between "United States" and "U.S." is calculated, but it is a fairly high value.

Execution result of problem 87


0.858448973235
Results of Problem 88

The problem is to look for something similar to "England", but unlike the result of Problem 88, the countries of the United Kingdom are listed first. word2vec is amazing.

Execution result of problem 88


Wales	0.7539543550055905
Scotland	0.7386559299178808
Britain	0.6479338653237635
Ireland	0.6348035977770026
Sweden	0.6046247805709913
Spain	0.6012807753931683
Germany	0.5945993118023707
England.	0.5886246671101062
Norway	0.5712078065200615
London	0.5622154447245881
Results of Problem 89

It is a problem of guessing the capital of Athens. In Problem 89, the correct answer "Greece" was 8th and the cosine similarity was 0.75, which was a little low, but in word2vec it was excellent at 0.81 in 3rd place. is.

Execution result of problem 89


Spain	0.8975386269080241
Austria	0.8165995526197494
Greece	0.8115120679668039
Egypt	0.8108041287727046
Italy	0.7967845991447613
Russia	0.7903349902284371
Denmark	0.784935131008747
Sweden	0.7731913094622944
Germany	0.7689020148989952
Portugal	0.7634638759682534

What is word2vec

"Word2vec" is an open source implementation for vectorizing words. It does what we've done in Chapter 9, but it's a big feature that it uses neural networks for dimensional compression. For details, you can find a lot of information by google with "word2vec", so I will omit it here. I studied at O'Reilly's Natural Language Processing with word2vec, but it was quite easy to understand.

Install word2vec

There seem to be some implementations of word2vec that can be used in python, but this time I used the word2vec wrapper library found in pip. You can install it with pip install word2vec.

python


segavvy@ubuntu:~$ pip search word2vec
brocas-lm (1.0)                  - Broca's LM is a free python library
                                   providing a probabilistic language model
                                   based on a Recurrent Neural Network (RNN)
                                   with Long Short-Term Memory (LSTM). It
                                   utilizes Gensim's Word2Vec implementation
                                   to transform input word sequences into a
                                   dense vector space. The output of the model
                                   is a seqeuence of probability distributions
                                   across the given vocabulary.
word2vec-wikification-py (0.16)  - A package to run wikification
sense2vec (0.6.0)                - word2vec with NLP-specific tokens
ShallowLearn (0.0.5)             - A collection of supervised learning models
                                   based on shallow neural network approaches
                                   (e.g., word2vec and fastText) with some
                                   additional exclusive features
theano-word2vec (0.2.1)          - word2vec using Theano and Lasagne
word2vec (0.9.1)                 - Wrapper for Google word2vec
word2veckeras (0.0.5.2)          - word2vec based on Kearas and gensim
segavvy@ubuntu:~$ pip install word2vec
Collecting word2vec
  Downloading word2vec-0.9.1.tar.gz (49kB)
    100% |████████████████████████████████| 51kB 1.9MB/s 
Requirement already satisfied: numpy in ./anaconda3/lib/python3.5/site-packages (from word2vec)
Requirement already satisfied: cython in ./anaconda3/lib/python3.5/site-packages (from word2vec)
Building wheels for collected packages: word2vec
  Running setup.py bdist_wheel for word2vec ... done
  Stored in directory: /home/segavvy/.cache/pip/wheels/f9/fa/6a/4cdbfefd2835490548505e4136b8f41f063d8f3c4639bf0f53
Successfully built word2vec
Installing collected packages: word2vec
Successfully installed word2vec-0.9.1

If you can "import word2vec" with this, the installation is completed.

segavvy@ubuntu:~$ python
Python 3.5.2 |Anaconda 4.1.1 (64-bit)| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import word2vec
>>> 

How to use word2vec

If you do help (word2vec.word2vec) in the python interpreter, the parameter description will be displayed. Although it is in English, it seems to be basically the same as the command line option of word2vec itself.

help in python interpreter(word2vec.word2vec)Result of


Help on function word2vec in module word2vec.scripts_interface:

word2vec(train, output, size=100, window=5, sample='1e-3', hs=0, negative=5, threads=12, iter_=5, min_count=5, alpha=0.025, debug=2, binary=1, cbow=1, save_vocab=None, read_vocab=None, verbose=False)
    word2vec execution
    
    Parameters for training:
        train <file>
            Use text data from <file> to train the model
        output <file>
            Use <file> to save the resulting word vectors / word clusters
        size <int>
            Set size of word vectors; default is 100
        window <int>
            Set max skip length between words; default is 5
        sample <float>
            Set threshold for occurrence of words. Those that appear with
            higher frequency in the training data will be randomly
            down-sampled; default is 0 (off), useful value is 1e-5
        hs <int>
            Use Hierarchical Softmax; default is 1 (0 = not used)
        negative <int>
            Number of negative examples; default is 0, common values are 5 - 10
            (0 = not used)
        threads <int>
            Use <int> threads (default 1)
        min_count <int>
            This will discard words that appear less than <int> times; default
            is 5
        alpha <float>
            Set the starting learning rate; default is 0.025
        debug <int>
            Set the debug mode (default = 2 = more info during training)
        binary <int>
            Save the resulting vectors in binary moded; default is 0 (off)
        cbow <int>
            Use the continuous back of words model; default is 1 (skip-gram
            model)
        save_vocab <file>
            The vocabulary will be saved to <file>
        read_vocab <file>
            The vocabulary will be read from <file>, not constructed from the
            training data

Although details are omitted, "train" can specify the input file name, "output" can specify the result output file name, "size" can specify the number of dimensions, and "threads" can specify the number of threads used for processing. The number of dimensions is specified as "300" to match the problem so far. Also, since it is necessary to convert the result of word2vec to the same format as Problem 85 after execution, set "binary = 0" so that the result can be read easily. Specify and put the result in text format.

Parsing word2vec result file

In the word2vec text format result file, the number of terms and the number of dimensions (= 300) are output on the first line separated by spaces. The second and subsequent lines are one term per line, and each line outputs 300 values of each dimension corresponding to the term, separated by spaces.

If you can use C language, you can see by looking at the source of word2vec itself, but since the value + blank is output in the loop that outputs the value of each dimension, an extra blank is added at the end of the line. .. Be careful when reading.

word2vec.Source excerpt of the result output part of c


    // Save the word vectors
    fprintf(fo, "%lld %lld\n", vocab_size, layer1_size);
    for (a = 0; a < vocab_size; a++) {
      fprintf(fo, "%s ", vocab[a].word);
      if (binary) for (b = 0; b < layer1_size; b++) fwrite(&syn0[a * layer1_size + b], sizeof(real), 1, fo);
      else for (b = 0; b < layer1_size; b++) fprintf(fo, "%lf ", syn0[a * layer1_size + b]);
      fprintf(fo, "\n");
    }

In this program, I first created a result file called "vectors.txt" with word2vec, read it, and implemented it in the flow of converting it to "matrix_x300.mat" and "dict_index_t".

word2vec has a smaller result file size

I have specified the same number of dimensions for the same data, but word2vec has a smaller result file size. This is because the function to remove low-frequency words is working. You can adjust it with "min_count", but the smaller the size of the result file, the faster the process, so this time I left it in the default state (terms with a frequency of less than 5 are removed).

In addition, word2vec also has a function to randomly remove frequently-used words. This can be adjusted with "sample".

That's all for the 91st knock. If you have any mistakes, I would appreciate it if you could point them out.


Recommended Posts

100 amateur language processing knocks: 71
100 amateur language processing knocks: 56
100 amateur language processing knocks: 24
100 amateur language processing knocks: 50
100 amateur language processing knocks: 59
100 amateur language processing knocks: 70
100 amateur language processing knocks: 62
100 amateur language processing knocks: 60
100 amateur language processing knocks: 92
100 amateur language processing knocks: 30
100 amateur language processing knocks: 06
100 amateur language processing knocks: 84
100 amateur language processing knocks: 81
100 amateur language processing knocks: 33
100 amateur language processing knocks: 46
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89
100 amateur language processing knocks: 40
100 amateur language processing knocks: 45
100 amateur language processing knocks: 43
100 amateur language processing knocks: 55
100 amateur language processing knocks: 22
100 amateur language processing knocks: 61
100 amateur language processing knocks: 94
100 amateur language processing knocks: 54
100 amateur language processing knocks: 04
100 amateur language processing knocks: 63
100 amateur language processing knocks: 78
100 amateur language processing knocks: 12
100 amateur language processing knocks: 14
100 amateur language processing knocks: 08
100 amateur language processing knocks: 19
100 amateur language processing knocks: 73
100 amateur language processing knocks: 75
100 amateur language processing knocks: 98
100 amateur language processing knocks: 83
100 amateur language processing knocks: 95
100 amateur language processing knocks: 32
100 amateur language processing knocks: 87
100 amateur language processing knocks: 79
100 amateur language processing knocks: 23
100 amateur language processing knocks: 05
100 amateur language processing knocks: 00
100 amateur language processing knocks: 02
100 amateur language processing knocks: 68
100 amateur language processing knocks: 11
100 amateur language processing knocks: 90
100 amateur language processing knocks: 74
100 amateur language processing knocks: 66
100 amateur language processing knocks: 28
100 amateur language processing knocks: 64
100 amateur language processing knocks: 34
100 amateur language processing knocks: 36
100 amateur language processing knocks: 77
100 amateur language processing knocks: 01
100 amateur language processing knocks: 16
100 amateur language processing knocks: 27
100 amateur language processing knocks: 10
100 amateur language processing knocks: 03
100 amateur language processing knocks: 82
100 amateur language processing knocks: 69