[PYTHON] TensorFlow Tutorial-Vector Representation of Words (Translation)

TensorFlow Tutorial (Vector Representations of Words) https://www.tensorflow.org/versions/master/tutorials/word2vec/index.html It is a translation of. We look forward to pointing out any translation errors.

In this tutorial, we look at the word2vec model in Mikolov et al.. I will continue. This model is used to learn a vector representation of words called "word embedding".

highlight

This tutorial aims to highlight the interesting and substantive part of building a word2vec model in TensorFlow.

Start by explaining the motivation for vectorizing words
We will look at the intuition behind the model and how the model is trained (by flashing math as a good measure).
Also shows a simple implementation of the model with TensorFlow
Finally, I'll show you how to make a naive version that scales better

We'll cover the code later in the tutorial, but if you prefer to get started directly, tensorflow / examples / tutorials / word2vec / word2vec_basic.py See the minimal implementation of /tensorflow/examples/tutorials/word2vec/word2vec_basic.py). This basic example contains the code needed to download some data, train a little with that data, and visualize the results. Once you get used to reading and running the basic version, [tensorflow / models / embedding / word2vec.py](https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/models/embedding/word2vec. You can move on to py). This is a heavier implementation and demonstrates some more advanced TensorFlow principles, such as how to use threads to move data efficiently into a text model, how to make checkpoints during training, and so on.

But first, let's see why we train word embedding. If you are an embedding professional, you can skip this section and go into the details.

Motivation: Why Learn to Embed Words?

Image or audio processing systems operate on rich, high-dimensional datasets encoded as vectors of individual raw pixel intensities of image data, or, for example, vectors of power spectral density coefficients of audio data. For tasks such as object recognition and speech recognition, all the information needed to successfully perform the task (because one can perform these tasks from raw data) is encoded in the data. I understand. However, natural language processing systems traditionally treat words as individual atomic symbols, such as "cat" as "Id537" and "dog" as "Id143". These encodings are arbitrary and do not provide the system with useful information about the relationships that may exist between individual symbols. In other words, what the model learns about "cats" can hardly be used to process data about "dogs" (for example, both animals, four legs, pets, etc.). In addition, representing words with unique and discrete IDs leads to sparse data, which often means that more data is needed to successfully train a statistical model. Vector representations can be used to overcome some of these obstacles.

The Vector Space Model (https://en.wikipedia.org/wiki/Vector_space_model) (VSM) maps semantically similar words to nearby points ("Points are embedded close to each other"). ), Represents a (embedded) word in a continuous vector space. In NLP, VSM has a long and rich history, but all methods are available in different ways in the Distribution Hypothesis (https://en.wikipedia.org/wiki/Distributional_semantics#Distributional_Hypothesis) (Distributional Hypothesis). Depends. The distribution hypothesis is that words that appear in the same context share a semantic meaning. Approaches that take advantage of this principle can be divided into two categories: count-based methods (eg, Latent Semantic Analysis (https://en.wikipedia.org/wiki/Latent_semantic_analysis)), and predictions. (For example, Neural Latent Language Models).

This distinction is described in more detail by Baroni et al.. Speaking of: The count-based method is a method of calculating statistics on how often a word co-occurs with its neighbors in a large text corpus and mapping these count statistics to a small, dense vector of each word. So, the prediction model attempts to predict words directly from adjacent words using a small, dense embedded vector that has been trained (which is considered a parameter of the model).

Word2vec is a particularly computationally efficient predictive model for learning word embedding from raw text. There are two types: the continuous word set model (CBOW) and the skipgram model. These models are algorithmically similar, but CBOW predicts the target word (eg, "mat") from the original context word ("cat sits"), and skipgram conversely predicts the original from the target word. Predict contextual words. This inversion may seem like an arbitrary choice, but statistically, CBOW has the effect of smoothing a lot of distribution information (by treating the entire context as one observation). In most cases, CBOW has proved useful in smaller datasets. Skipgrams, on the other hand, treat each context / target word pair as a new observation, which tends to be better for larger datasets. From now on, this tutorial will focus on the skipgram model.

Scale up with noise control training

The neural probabilistic language model traditionally calculates the probability of the next word $ w_t $ (t of "target") from multiple words $ h $ (h of "history") that have already appeared [softmax function]. Train using the Maximum Likelihood (https://en.wikipedia.org/wiki/Maximum_likelihood) (ML) method for (https://en.wikipedia.org/wiki/Softmax_function).

P(w_t | h) = \text{softmax}(\text{score}(w_t, h)) \\
           = \frac{\exp\{\text{score}(w_t, h)\}}
             {\sum_\text{Word w' in Vocab} \exp \{ \text{score}(w', h) \} }.

Here, $ \ text {score} (w_t, h) $ calculates the compatibility of the word $ w_t $ with the context $ h $ (generally by dot product). Train this model by maximizing the log-likelihood of the training set. That is, maximize the following:

 J_\text{ML} = \log P(w_t | h) \\
  = \text{score}(w_t, h) -
     \log \left( \sum_\text{Word w' in Vocab} \exp \{ \text{score}(w', h) \} \right)

It produces a well-normalized probabilistic model for language modeling. However, this is because at each learning step each probability must be calculated and normalized using the scores of all other $ V $ words $ w'$ in the current context $ h $. It is very costly.

On the other hand, feature learning with word2vec does not require a complete probabilistic model. The CBOW and skipgram models instead use a binary taxonomy (to identify the actual target word $ w_t $ from $ k $ virtual (noise) words $ \ tilde w $ in the same context. It is trained using logistic regression). This is explained for the CBOW model. For skipgrams, the orientation is simply flipped.

Mathematically, the purpose (for each example) is to maximize:

J_\text{NEG} = \log Q_\theta(D=1 |w_t, h) +
  k \mathop{\mathbb{E}}_{\tilde w \sim P_\text{noise}}
     \left[ \log Q_\theta(D = 0 |\tilde w, h) \right]

Where $ Q_ \ theta (D = 1 | w, h) $ learns the probability of the word $ w $ appearing in the context $ h $ in the dataset $ D $ under the model. Calculated for the vector $ \ theta $. It actually approximates the expected value by drawing a $ k $ control word from the noise distribution (ie calculating the Monte Carlo mean).

This objective function is maximized when the model assigns high probabilities to true words and low probabilities to false words. Technically, this is called Negative Samples. There is a good mathematical motivation to use this loss function: the update proposed by this method is asymptotic to the softmax function update in the limit. However, the calculation of the loss function is particularly attractive because it is proportional to the number of $ k $ noise words selected and not to the $ V $ of all words in the vocabulary. This makes training much faster. In fact, it's very similar Noise Control Evaluation (NCE). Use loss. TensorFlow has a handy helper function tf.nn.nce_loss ().

Now let's see how this actually works!

Skipgram model

As an example, consider the following dataset.

the quick brown fox jumped over the lazy dog

First, it forms a dataset of words and the context in which they appear. You can define a "context" in any way that makes sense. In fact, the syntactic context (ie, the syntactic dependency of the currently targeted word, [Levy et al.](Https://levyomer.files.wordpress.com/2014/04/dependency-based-word- See embeddings-acl-2014.pdf)), multiple words to the left of the target, multiple words to the right of the target, etc. have been considered. Let's take a simple definition here and define "context" as a window of multiple words to the left and right of the target word. Using window size 1, you get the following (context, target) pairs of datasets.

([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...

Recall that Skipgram inverts the context and the target and attempts to predict each context word from the target word. Therefore, the task is to predict "the" and "brown" from "quick" and "quick" and "fox" from "brown". Therefore, the dataset will be the following (input, output) pairs.

(quick, the), (quick, brown), (brown, quick), (brown, fox), ...

The objective function is defined across the dataset, but generally using one sample at a time (or a "mini-batch" of batch_size samples, where 16 <= batch_size <= 512) in general [ Optimize this with Stochastic Gradient Descent (https://en.wikipedia.org/wiki/Stochastic_gradient_descent) (SGD). Let's take a look at one step in this process.

Imagine observing the first training case above in training step $ t $, the goal here is to predict "the" from "quick". Select num_noise noise (control) samples by drawing from some noise distribution, usually a unigram distribution, $ P (w) $. For simplicity, let's set num_noise = 1 and select "sheep" as the noise sample. Next, let's calculate the loss of this observed-noise sample pair, that is, the objective function at step $ t $ is:

J^{(t)}_\text{NEG} = \log Q_\theta(D=1 | \text{the, quick}) +
  \log(Q_\theta(D=0 | \text{sheep, quick}))

The goal is to update the embedded parameter $ \ theta $ to improve (maximize in this case) this objective function. We do this by deriving the gradient of loss for the embedded parameter $ \ theta $, ie $ \ frac {\ partial} {\ partial \ theta} J_ \ text {NEG} $ (lucky) TensorFlow also provides a simple helper function to do this). Then do the embedding update by taking a small step in the direction of the gradient. Repeating this process across the entire training set has the effect of "moving" the embedded vector for each word until the model succeeds in identifying the actual word from the noise words.

The learned vector can be visualized by projecting it in two dimensions using, for example, t-SNE dimension reduction method. Examination of these visualizations reveals that vectors capture general and practically very useful semantic information about words and their relationships with each other. We first discovered that a particular direction in the derived vector space was specific to a particular semantic relationship, for example, the gender of men and women and the relationship between the words of the country and the capital, as shown in the following figure. It was very interesting when I did (see also Mikolov et al., 2013).

This explains why these vectors are useful as features in many standard NLP prediction tasks, such as part-of-speech tagging and named entity recognition (Collobert et al., 2011. Original research by //arxiv.org/abs/1103.0398) (pdf) and [Turian et al., 2010](http://www. See follow-up by aclweb.org/anthology/P10-1040)).

But here, let's just use them to draw pretty pictures!

Graph construction

That's all about embedding. Now let's define an embedded matrix. This is initially a large random matrix. Initialize the value with a uniform random number.

embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

Defines the loss of noise control evaluation for a logistic regression model. For this reason, we need to define weights and biases for each word in the vocabulary (called "output weights" as opposed to "input embedding"). Now let's define it.

nce_weights = tf.Variable(
  tf.truncated_normal([vocabulary_size, embedding_size],
                      stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

Now that you have defined the parameters, you can define a graph for the skipgram model. For simplicity, let's say each word in the vocabulary is represented as an integer and the text corpus is already integerized (for more information [tensorflow / https: //www.tensorflow.org/versions/master/tutorials] /word2vec/index.html/tutorials/word2vec/word2vec_basic.py](see https://tensorflow.googlesource.com/tensorflow/+/master/tensorflow/g3doc/tutorials/word2vec/word2vec_basic.py). The skipgram model takes two inputs. One is an integer-filled batch that represents the source context word, and the other is for the target word. Let's create a placeholder node for these inputs so that we can send the data later.

# Placeholders for inputs
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])

What we need to do here is to look up the vector for each source word in the batch. TensorFlow has a handy helper to make this easy.

embed = tf.nn.embedding_lookup(embeddings, train_inputs)

OK, we now have an embedding for each word, so let's try to predict the target word using the objective function of the noise control training.

# Compute the NCE loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(
  tf.nn.nce_loss(nce_weights, nce_biases, embed, train_labels,
                 num_sampled, vocabulary_size))

Now that we have a node for loss, we need to add more nodes, such as calculating the gradient and updating the parameters. We use stochastic gradient descent for this, but TensorFlow also has a handy helper to make this easy.

# We use the SGD optimizer.
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)

Model training

Training the model is easy, use feed_dict to push the data to the placeholders and in a loop with this new data session.run Just call /client.html#Session.run).

for inputs, labels in generate_batch(...):
  feed_dict = {training_inputs: inputs, training_labels: labels}
  _, cur_loss = session.run([optimizer, loss], feed_dict=feed_dict)

See tensorflow / g3doc / tutorials / word2vec / word2vec_basic.py for a complete code example.

Visualization of trained embedding

After training, t-SNE can be used to visualize trained embeddings.

It's done! As expected, similar words eventually form clusters close to each other. For a heavier word2vec implementation that further demonstrates the advanced features of TensorFlow, see tensorflow / models / embedding / word2vec.py See the implementation of /embedding/word2vec.py).

Evaluation of embedding: analogy reasoning

Embedding is useful in various predictive tasks in NLP. Lack of training in full-fledged part-speech or named entity models, one easy way to assess embedding is to directly predict syntactic or semantic relationships, such as "What is a queen for a king for a father?" Is to use them for. This is called analogy inference, the task was introduced by Mikolov et al., and the dataset is You can download it here: https://word2vec.googlecode.com/svn/trunk/questions-words.txt

For how to evaluate this, build_eval_graph () in tensorflow / models / embedding / word2vec.py And see the eval () function.

Hyperparameter selection has a strong impact on the accuracy of this task. To achieve state-of-the-art performance in this task, train on very large datasets, carefully tune hyperparameters, and subsampling data that is outside the scope of this tutorial. You need to use tricks like this.

Implementation optimization

This simple implementation showscase the flexibility of TensorFlow. For example, changing training objectives is as easy as replacing the call to tf.nn.nce_loss () with an off-the-shelf alternative such as tf.nn.sampled_softmax_loss (). If you have a new idea for a loss function, you can manually write a new desired expression in TensorFlow and let the optimizer calculate its derivative. This flexibility is invaluable for trying out and quickly iterating several different ideas during the exploratory phase of machine learning model development.

Once you have a satisfactory model structure, it may be worth optimizing your implementation for more efficient execution (and to cover more data in less time). For example, the naive code in this tutorial is slowed down because the TensorFlow backend uses Python to read and supply data items, which requires little work. If you find a serious bottleneck in the input data in your model, as described in New Data Formats (https://www.tensorflow.org/versions/master/how_tos/new_data_formats/index.html) You can also implement a custom data reader for the problem. For skipgram models, for example, tensorflow / models / embedding / word2vec.py I've already done this.

If your model is no longer I / O rate-determining and you still need more performance, go to Add New Operation (https://www.tensorflow.org/versions/master/how_tos/adding_an_op/index.html) As mentioned, you can get even more performance by writing your own TensorFlow operations. Again, provide tensorflow / models / embedding / word2vec_optimized.py as an example for skipgrams. doing. Benchmark each other to measure the performance improvement at each stage.

Conclusion

This tutorial covered the word2vec model, a computationally efficient model for learning word embedding. We motivated why embeddings are useful, discussed efficient training techniques, and showed how to implement all of this in TensorFlow. Overall, we hope we've been able to showcase how TensorFlow provides the flexibility needed for early experiments and the controls needed for subsequent bespoke optimized implementations. I have.