Introduction

A memorandum of the process of creating Word Embedding (Embedding Matrix) for natural language processing using the GloVe algorithm.

Purpose

Created because Word Embedding is required to create a generative chatbot. Transfer Learning, which utilizes trained materials, is generally considered to be practical. This time, I made my own to deepen my understanding of the "embedded matrix" that enables "analogies" by vectorizing the relationships between words in a linear space.

environment

Data preprocessing

macOS Catalina 10.15.5
Python 3.7 --MeCab (Japanese morphological analysis tool)
WikiExtractor

Training

Google Colaboratory

Reference information

GloVe Papers and Commentary Page (Stanford University)

Referenced GloVe Python code (GitHub)

Python code used (GitHub)

Japanese text data and preprocessing

Japanese text data

Wikipedia data dump is used as Japanese text data. The total size is about 13GB. As it is, XML tags etc. are attached, so use "WikiExtractor" to extract the part you want to use for training.

Japanese word-separation

Use the morphological analysis tool MeCab to divide Japanese text into units of meaning.

Challenges faced

Huge amount of data and RAM limit

First of all, aiming at trial creation, I tried to use only 100MB filed by Wiki Extractor.

--Number of lines: 254,103 lines --Vocabulary: 218,438 words

Each line contains a series of sentences. When I passed 100MB of preprocessed data, I reached the RAM limit and the session crashed.

With about 12GB of RAM available in Colab, the sharp increase in memory pressure was in the use of matrices containing word-to-word co-occurrence frequency information. The following three places are mainly used.

When creating a matrix
When calculating the co-occurrence frequency
During parameter training

It was too big to handle because it simply uses an array of vocabulary x vocabulary (218,438 x 218,438).

The method taken instead

The training was conducted by subdividing the text, which has about 250,000 lines. After each data set training, only the parameters were saved, and the matrix for which the co-occurrence frequency was calculated was reset (initialized) without being inherited. This allows you to train your parameters across the dataset while keeping it within the available memory.

problem

The advantage of GloVe is that it utilizes statistical information such as the frequency of co-occurrence between words for learning, so it is presumed that the accuracy of the word analogy task is reduced because it is not utilized to the maximum. To.

code

Create a dictionary from the entire text data

The training data for each training is divided, but the parameters and dictionary need to be shared throughout the dataset, so they are generated first.

`build_vocab.py`


from collections import Counter

with open('preprocessed.txt') as f:
  preprocessed_text = f.readlines()

corpus = []
for line in preprocessed_text:
  line_lower = line.lower()
  corpus.append(line_lower)

vocab = Counter()
for line in corpus:
  tokens = line.strip().split()
  vocab.update(tokens)

vocab = {word: (i, freq) for i, (word, freq) in enumerate(vocab.items())}
id2word = dict((i, word) for word, (i, _) in vocab.items())

Counter () is a container data type class that adds elements while counting them. This creates a dictionary that contains information on the number of times a word appears.

From there, create a (token) dictionary that assigns numbers to words and a dictionary that replaces Key and Value.

Training data split

`split_corpus.py`


import math 

split_size = math.floor(len(corpus)/25)

start = split_size - 1000
end = split_size * 2

split_corpus = corpus[start:end]

print('\nLength of split_corpus: ', len(split_corpus))
#About 11,000 lines

The number of lines of text used for training is about 250,000, and it is divided into 25 in an easy-to-understand manner so that it can be processed without crashing.

At the time of training for each data set, it is thought that the co-occurrence frequency information near the break will be less than it should be, so the start of the index is reduced by 1000 lines from the break to create a cohesive data near the break.

Generation of co-occurrence matrix and calculation of co-occurrence frequency

`build_cooccur.py`


from scipy import sparse

window_size = 10
min_count = None
vocab_size = len(vocab)

cooccurrences = sparse.lil_matrix((vocab_size, vocab_size), dtype=np.float64)

for i, line in enumerate(split_corpus):
  if i % 1000 == 0:
    logger.info('Building cooccurrence matrix: on line %i', i)

  tokens = line.strip().split()
  token_ids = [vocab[word][0] for word in tokens]

  for center_i, center_id in enumerate(token_ids):
    context_ids = token_ids[max(0, center_i - window_size):center_i]
    contexts_len = len(context_ids)

    for left_i, left_id in enumerate(context_ids):
      distance = contexts_len - left_i 
      increment = 1.0 / float(distance)

      cooccurrences[center_id, left_id] += increment
      cooccurrences[left_id, center_id] += increment


for i, (row, data) in enumerate(zip(cooccurrences.row, cooccurrences.data)):
  if i % 50000 == 0:
    logger.info('yield cooccurrence matrix: on line %i', i)

  if min_count is not None and vocab[id2word[i]][1] < min_count:
    continue

  for data_idx, j in enumerate(row):
    if min_count is not None and vocab[id2word[j]][1] < min_count:
      continue

    yield i, j, data[data_idx]

#The return value is functools.Decorate with wraps and return as a list.

To save memory, create a cooccurrence matrix with scipy's sparse.lil_matrix, which has a mechanism to hold only the values of nonzero elements. When I try to create a matrix in Numpy, that alone puts pressure on memory and causes a crash.

Also, even sparse.lil_matrix () crashed as a result of memory pressure due to the large number of non-zero elements around 30,000 lines of text data.

Parameter training

Part of Glove's formula: f (x_ij) ((theta_i ^ t e_j) + b_i + b_j --log (X_ij)) (See the paper for details)

i and j correspond to the number of vocabulary elements, and the weight and bias of the parameters have a symmetric relationship. In addition, the operation of theta_i and e_j is performed by the dot product (inner product).

`train_glove.py`


import numpy as np
import h5py
from random import shuffle

#Parameter initialization.
#Do it only the first time, and then relay while saving in HDF5 format.
#W = (np.random.rand(vocab_size * 2) - 0.5) / float(vector_size + 1)
#biases = (np.random.rand(vocab_size * 2) - 0.5 / float(vector_size + 1)
#gradient_squared = np.ones((vocab_size * 2, vector_size), dtype=np.float64)
#gradient_squared_biases = np.ones(vocab_size * 2, dtype=np.float64)

with h5py.File('glove_weight_relay.h5', 'r') as f:
  W = f['glove']['weights'][...]
  biases = f['glove']['biases'][...]
  gradient_squared = f['glove']['gradient_squared'][...]
  gradient_squared_biases = f['glove']['gradient_squared_biases'][...]

#The calculated co-occurrence matrix information and parameters are summarized in tuples.
data = [(W[i_target], 
         W[i_context + vocab_size],
         biases[i_target : i_target + 1],
         biases[i_context + vocab_size : i_context + vocab_size + 1],
         gradient_squared[i_target],
         gradient_squared[i_context + vocab_size],
         gradient_squared_biases[i_target : i_target + 1],
         gradient_squared_biases[i_context + vocab_size : i_context + vocab_size + 1],
         cooccurrence)
         for i_target, i_context, cooccurrence in cooccurrences]


iterations = 7
learning_rate = 0.05
x_max = 100
alpha = 0.75
for i in range(iterations):
  shuffle(data)
  
  for (v_target, v_context, b_target, b_context, gradsq_W_target, gradsq_W_context, gradsq_b_target, gradsq_b_context, cooccurrence) in data:
# f(X_ij)Part of
    weight = (cooccurrence / x_max) ** alpha if cooccurrence < x_max else 1
#Inner product of weights, bias addition, subtraction by log
    cost_inner = (v_target.dot(v_context) + b_target[0] + b_context[0] - log(cooccurrence))

    cost = weight * (cost_innner ** 2)
    #Code used to understand the cost per iteration
    #global_cost += 0.5 * cost_inner

#Partial differential calculation
    grad_target = weight * cost_inner * v_context
    grad_context = weight * cost_inner * v_target
    
    grad_bias_target = weight * cost_inner
    grad_bias_context = weight * cost_inner

#Gradient update = learning process
    v_target -= (learning_rate * grad_target / np.sqrt(gradsq_W_target))
    v_context -= (learning_rate * grad_context / np.sqrt(gradsq_W_context))

    b_target -= (learning_rate * grad_bias_target / np.sqrt(gradsq_b_target))
    b_context -= (learning_rate * grad_bias_context / np.sqrt(gradsq_b_context))

#Calculate the value to be used for division at the next update
    gradsq_W_target += np.square(grad_target)
    gradsq_W_context += np.square(grad_context)
    gradsq_b_target += grad_bias_target ** 2
    gradsq_b_context += grad_bias_context ** 2

#Save parameters
with h5py.File('glove_weight_relay.h5', 'a') as f:
  f['glove']['weights'][...] = W
  f['glove']['biases'][...] = biases
  f['glove']['gradient_squared'][...] = gradient_squared
  f['glove']['gradient_squared_biases'][...] = gradient_squared_biases

Parameters such as weights and biases are the same throughout the training, so initialization is done only the first time. After that, save and read are repeated.

Set the number of training repetitions to 7. The more repetitions, the better the parameters, but I reduced it because it takes a long time for each time and it is created on a trial basis.

In the gradient update, the square root of the square is divided by the learning process "Adaptive gradient descent" used in GloVe. A method similar to the stochastic gradient descent method.

Confirmation of training results

`test_similarity.py`



#Calculate the average by summing the symmetric parameters.
vocab_size = int(len(W) / 2)
for i, row in enumerate(W[:vocab_size]):
  merged = merge_fun(row, W[i + vocab_size])
  if normalize:
    merged /= np.linalg.norm(merged)
  W[i, :] = merged

merge_W = W[:vocab_size]


#Find the top 15 words with similar vectors.
word = 'word'
n = 15
word_id = vocab[word][0]

dists = np.dot(merge_W, W[word_id])
top_ids = np.argsort(dists)[::-1][:n + 1]
similar = [id2word[id] for id in to_ids if id!= word_id][:n]

numpy.argsort () returns an index after sorting in ascending order. To descend, refer to the index with [:: -1].

スクリーンショット 2020-06-10 18.25.49.png 15 words with vectors close to "words". A list of words that are thought to be used in a similar way in terms of semantics and syntax.

スクリーンショット 2020-06-10 18.33.16.png When I searched for a word with a vector similar to "America", it was listed mainly by country name.

[PYTHON] GloVe: Prototype of Word Embedding by Gloval Vectors for Word Representation