A memorandum of the process of creating Word Embedding (Embedding Matrix) for natural language processing using the GloVe algorithm.
Created because Word Embedding is required to create a generative chatbot. Transfer Learning, which utilizes trained materials, is generally considered to be practical. This time, I made my own to deepen my understanding of the "embedded matrix" that enables "analogies" by vectorizing the relationships between words in a linear space.
Data preprocessing
Training
GloVe Papers and Commentary Page (Stanford University)
Referenced GloVe Python code (GitHub)
Wikipedia data dump is used as Japanese text data. The total size is about 13GB. As it is, XML tags etc. are attached, so use "WikiExtractor" to extract the part you want to use for training.
Use the morphological analysis tool MeCab to divide Japanese text into units of meaning.
First of all, aiming at trial creation, I tried to use only 100MB filed by Wiki Extractor.
--Number of lines: 254,103 lines --Vocabulary: 218,438 words
Each line contains a series of sentences. When I passed 100MB of preprocessed data, I reached the RAM limit and the session crashed.
With about 12GB of RAM available in Colab, the sharp increase in memory pressure was in the use of matrices containing word-to-word co-occurrence frequency information. The following three places are mainly used.
It was too big to handle because it simply uses an array of vocabulary x vocabulary (218,438 x 218,438).
The training was conducted by subdividing the text, which has about 250,000 lines. After each data set training, only the parameters were saved, and the matrix for which the co-occurrence frequency was calculated was reset (initialized) without being inherited. This allows you to train your parameters across the dataset while keeping it within the available memory.
The advantage of GloVe is that it utilizes statistical information such as the frequency of co-occurrence between words for learning, so it is presumed that the accuracy of the word analogy task is reduced because it is not utilized to the maximum. To.
The training data for each training is divided, but the parameters and dictionary need to be shared throughout the dataset, so they are generated first.
build_vocab.py
from collections import Counter
with open('preprocessed.txt') as f:
preprocessed_text = f.readlines()
corpus = []
for line in preprocessed_text:
line_lower = line.lower()
corpus.append(line_lower)
vocab = Counter()
for line in corpus:
tokens = line.strip().split()
vocab.update(tokens)
vocab = {word: (i, freq) for i, (word, freq) in enumerate(vocab.items())}
id2word = dict((i, word) for word, (i, _) in vocab.items())
Counter () is a container data type class that adds elements while counting them. This creates a dictionary that contains information on the number of times a word appears.
From there, create a (token) dictionary that assigns numbers to words and a dictionary that replaces Key and Value.
split_corpus.py
import math
split_size = math.floor(len(corpus)/25)
start = split_size - 1000
end = split_size * 2
split_corpus = corpus[start:end]
print('\nLength of split_corpus: ', len(split_corpus))
#About 11,000 lines
The number of lines of text used for training is about 250,000, and it is divided into 25 in an easy-to-understand manner so that it can be processed without crashing.
At the time of training for each data set, it is thought that the co-occurrence frequency information near the break will be less than it should be, so the start of the index is reduced by 1000 lines from the break to create a cohesive data near the break.
build_cooccur.py
from scipy import sparse
window_size = 10
min_count = None
vocab_size = len(vocab)
cooccurrences = sparse.lil_matrix((vocab_size, vocab_size), dtype=np.float64)
for i, line in enumerate(split_corpus):
if i % 1000 == 0:
logger.info('Building cooccurrence matrix: on line %i', i)
tokens = line.strip().split()
token_ids = [vocab[word][0] for word in tokens]
for center_i, center_id in enumerate(token_ids):
context_ids = token_ids[max(0, center_i - window_size):center_i]
contexts_len = len(context_ids)
for left_i, left_id in enumerate(context_ids):
distance = contexts_len - left_i
increment = 1.0 / float(distance)
cooccurrences[center_id, left_id] += increment
cooccurrences[left_id, center_id] += increment
for i, (row, data) in enumerate(zip(cooccurrences.row, cooccurrences.data)):
if i % 50000 == 0:
logger.info('yield cooccurrence matrix: on line %i', i)
if min_count is not None and vocab[id2word[i]][1] < min_count:
continue
for data_idx, j in enumerate(row):
if min_count is not None and vocab[id2word[j]][1] < min_count:
continue
yield i, j, data[data_idx]
#The return value is functools.Decorate with wraps and return as a list.
To save memory, create a cooccurrence matrix with scipy's sparse.lil_matrix, which has a mechanism to hold only the values of nonzero elements. When I try to create a matrix in Numpy, that alone puts pressure on memory and causes a crash.
Also, even sparse.lil_matrix () crashed as a result of memory pressure due to the large number of non-zero elements around 30,000 lines of text data.
Part of Glove's formula: f (x_ij) ((theta_i ^ t e_j) + b_i + b_j --log (X_ij)) (See the paper for details)
i and j correspond to the number of vocabulary elements, and the weight and bias of the parameters have a symmetric relationship. In addition, the operation of theta_i and e_j is performed by the dot product (inner product).
train_glove.py
import numpy as np
import h5py
from random import shuffle
#Parameter initialization.
#Do it only the first time, and then relay while saving in HDF5 format.
#W = (np.random.rand(vocab_size * 2) - 0.5) / float(vector_size + 1)
#biases = (np.random.rand(vocab_size * 2) - 0.5 / float(vector_size + 1)
#gradient_squared = np.ones((vocab_size * 2, vector_size), dtype=np.float64)
#gradient_squared_biases = np.ones(vocab_size * 2, dtype=np.float64)
with h5py.File('glove_weight_relay.h5', 'r') as f:
W = f['glove']['weights'][...]
biases = f['glove']['biases'][...]
gradient_squared = f['glove']['gradient_squared'][...]
gradient_squared_biases = f['glove']['gradient_squared_biases'][...]
#The calculated co-occurrence matrix information and parameters are summarized in tuples.
data = [(W[i_target],
W[i_context + vocab_size],
biases[i_target : i_target + 1],
biases[i_context + vocab_size : i_context + vocab_size + 1],
gradient_squared[i_target],
gradient_squared[i_context + vocab_size],
gradient_squared_biases[i_target : i_target + 1],
gradient_squared_biases[i_context + vocab_size : i_context + vocab_size + 1],
cooccurrence)
for i_target, i_context, cooccurrence in cooccurrences]
iterations = 7
learning_rate = 0.05
x_max = 100
alpha = 0.75
for i in range(iterations):
shuffle(data)
for (v_target, v_context, b_target, b_context, gradsq_W_target, gradsq_W_context, gradsq_b_target, gradsq_b_context, cooccurrence) in data:
# f(X_ij)Part of
weight = (cooccurrence / x_max) ** alpha if cooccurrence < x_max else 1
#Inner product of weights, bias addition, subtraction by log
cost_inner = (v_target.dot(v_context) + b_target[0] + b_context[0] - log(cooccurrence))
cost = weight * (cost_innner ** 2)
#Code used to understand the cost per iteration
#global_cost += 0.5 * cost_inner
#Partial differential calculation
grad_target = weight * cost_inner * v_context
grad_context = weight * cost_inner * v_target
grad_bias_target = weight * cost_inner
grad_bias_context = weight * cost_inner
#Gradient update = learning process
v_target -= (learning_rate * grad_target / np.sqrt(gradsq_W_target))
v_context -= (learning_rate * grad_context / np.sqrt(gradsq_W_context))
b_target -= (learning_rate * grad_bias_target / np.sqrt(gradsq_b_target))
b_context -= (learning_rate * grad_bias_context / np.sqrt(gradsq_b_context))
#Calculate the value to be used for division at the next update
gradsq_W_target += np.square(grad_target)
gradsq_W_context += np.square(grad_context)
gradsq_b_target += grad_bias_target ** 2
gradsq_b_context += grad_bias_context ** 2
#Save parameters
with h5py.File('glove_weight_relay.h5', 'a') as f:
f['glove']['weights'][...] = W
f['glove']['biases'][...] = biases
f['glove']['gradient_squared'][...] = gradient_squared
f['glove']['gradient_squared_biases'][...] = gradient_squared_biases
Parameters such as weights and biases are the same throughout the training, so initialization is done only the first time. After that, save and read are repeated.
Set the number of training repetitions to 7. The more repetitions, the better the parameters, but I reduced it because it takes a long time for each time and it is created on a trial basis.
In the gradient update, the square root of the square is divided by the learning process "Adaptive gradient descent" used in GloVe. A method similar to the stochastic gradient descent method.
test_similarity.py
#Calculate the average by summing the symmetric parameters.
vocab_size = int(len(W) / 2)
for i, row in enumerate(W[:vocab_size]):
merged = merge_fun(row, W[i + vocab_size])
if normalize:
merged /= np.linalg.norm(merged)
W[i, :] = merged
merge_W = W[:vocab_size]
#Find the top 15 words with similar vectors.
word = 'word'
n = 15
word_id = vocab[word][0]
dists = np.dot(merge_W, W[word_id])
top_ids = np.argsort(dists)[::-1][:n + 1]
similar = [id2word[id] for id in to_ids if id!= word_id][:n]
numpy.argsort () returns an index after sorting in ascending order. To descend, refer to the index with [:: -1].
15 words with vectors close to "words". A list of words that are thought to be used in a similar way in terms of semantics and syntax.
When I searched for a word with a vector similar to "America", it was listed mainly by country name.