[PYTHON] Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 4

Introduction

I suddenly started studying "Deep Learning from scratch ❷ --- Natural language processing" Note that I stumbled in Chapter 4. is.

The execution environment is macOS Catalina + Anaconda 2019.10, and the Python version is 3.7.4. For details, refer to Chapter 1 of this memo.

Chapter 4 Speeding up word2vec

This chapter is a speedup of the word2vec CBOW model created in Chapter 3.

4.1 Improvement of word2vec ①

First is the speedup from the input layer to the intermediate layer. This part plays the role of embedding that converts words into distributed expressions, but since the MatMul layer is wasteful, replace it with the Embedding layer.

The Embedding layer is simple, but the part that adds $ dW $ when idx is duplicated in the backpropagation implementation may be a bit confusing. In the book, it is taken up in Figure 4-5, and the explanation is omitted, "Let's think about why we add."

Therefore, I thought about it by comparing it with the backpropagation calculation for the MatMul layer. This is because the Embedding layer must have the same result as the MatMul layer.

First, change $ idx $ in Figure 4-5 back to $ x $ in the MatMul layer.

\begin{align}
idx &= 
\begin{pmatrix}
0\\
2\\
0\\
4\\
\end{pmatrix}\\
\\
x &=
\begin{pmatrix}
1 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 1 & 0 & 0 & 0 & 0\\
1 & 0 & 0 & 0 & 0 & 0 & 0\\
0 & 0 & 0 & 0 & 1 & 0 & 0\\
\end{pmatrix}
\end{align}

The backpropagation formula for the MatMul layer is $ \ frac {\ partial L} {\ partial W} = x ^ T \ frac {\ partial L} {\ partial y} $ (see page 33), so Figure 4-5 If you replace it with the notation of, it becomes $ dw = x ^ Tdh $. Applying $ x $ and $ dh $ in Figure 4-5 here to calculate $ dW $, we get: Actually, I wanted to make $ dh $ as shown in Fig. 4-5, but I can't express the shade of ● like a book, so here I express it with $ ●, ◆, a, b $.

\begin{align}
dW &= x^Tdh\\
\\
\begin{pmatrix}
? & ? & ? \\
○ & ○ & ○ \\
●_1 & ●_2 & ●_3 \\
○ & ○ & ○ \\
◆_1 & ◆_2 & ◆_3 \\
○ & ○ & ○ \\
○ & ○ & ○ \\
\end{pmatrix}
&=
\begin{pmatrix}
1 & 0 & 1 & 0\\
0 & 0 & 0 & 0\\
0 & 1 & 0 & 0\\
0 & 0 & 0 & 0\\
0 & 0 & 0 & 1\\
0 & 0 & 0 & 0\\
0 & 0 & 0 & 0\\
\end{pmatrix}
\begin{pmatrix}
a_1 & a_2 & a_3 \\
●_1 & ●_2 & ●_3 \\
b_1 & b_2 & b_3 \\
◆_1 & ◆_2 & ◆_3 \\
\end{pmatrix}\\
\end{align}

As you can see from the calculation, the second line ($ ● _1 ● _2 ● _3 ) and the fourth line ( ◆ _1 ◆ _2 ◆ _3 $) of $ dh $ are $ dW $ as they are as shown in Fig. 4-5. It will be the 3rd and 5th lines of. And the $? $ On the first line of the $ dW $ in question looks like this:

\begin{align}
\begin{pmatrix}
a_1 + b_1 & a_2 + b_2 & a_3 + b_3 \\
○ & ○ & ○ \\
●_1 & ●_2 & ●_3 \\
○ & ○ & ○ \\
◆_1 & ◆_2 & ◆_3 \\
○ & ○ & ○ \\
○ & ○ & ○ \\
\end{pmatrix}
&=
\begin{pmatrix}
1 & 0 & 1 & 0\\
0 & 0 & 0 & 0\\
0 & 1 & 0 & 0\\
0 & 0 & 0 & 0\\
0 & 0 & 0 & 1\\
0 & 0 & 0 & 0\\
0 & 0 & 0 & 0\\
\end{pmatrix}
\begin{pmatrix}
a_1 & a_2 & a_3 \\
●_1 & ●_2 & ●_3 \\
b_1 & b_2 & b_3 \\
◆_1 & ◆_2 & ◆_3 \\
\end{pmatrix}
\end{align}

In other words, you can see that we are adding the first and third lines of $ dh $. The same thing as the calculation of this MatMul layer must be implemented in the Embedding layer, so it is necessary to add it.

4.2 Improvement of word2vec ②

Next is the improvement from the intermediate layer to the output layer. Negative Sampling's idea of boldly reducing learning with negative examples is interesting.

I didn't have a big stumbling block, but in the book, the explanation of backpropagation of the Embedding Dot layer is omitted because it is "not a difficult problem, so let's think about it for yourself", so I will summarize here a little. to watch.

In Figure 4-12, if you cut out only the part of the Embedding Dot layer, it will be as follows.

図1.png

What the dot node is doing is multiplying each element and adding the results. Therefore, consider back propagation by decomposing into a multiplication node (see "1.3.4.1 Multiplication node" in Chapter 1) and a Sum node (see "1.3.4.4 Sum node" in Chapter 1). Then it will have the following form. Blue letters are backpropagation.

図2.png

Returning this to the previous Dot node diagram, it looks like this:

図3.png

If you implement it as shown in this figure, it is OK, but as it is, the shape of dout does not match with h and target_W, so the product of each element is not calculated by*of NumPy. Therefore, first match the shapes with dout.reshape (dout.shape [0], 1) and then calculate the product. If you implement it in this way, you can see that it becomes the code of ʻEmbeddingDot.backwad ()` in the book.

4.3 Learning improved word2vec

The learning implementation is not particularly stumbling. I use the PTB corpus in the book, but I like Japanese after all, so I tried learning with Aozora Bunko's divided text as in Chapter 2. It was.

To get the corpus, use the modified version of dataset / aozorabunko.py instead of dataset / ptb.py. For more information on this source and mechanism, see [Chapter 2 Memo "Improvement of Count-Based Method"](https://qiita.com/segavvy/items/52feabbf7867020e117d#24-Improvement of Count-Based Method). I have written it, so please refer to it.

ch04 / train.py has also been changed to use the corpus of Aozora Bunko as follows. The changes are those with in the comments.

ch04/train.py


# coding: utf-8
import sys
sys.path.append('..')
from common import config
#When executing on GPU, delete the comment out below (cupy required)
# ===============================================
# config.GPU = True
# ===============================================
from common.np import *
import pickle
from common.trainer import Trainer
from common.optimizer import Adam
from cbow import CBOW
from skip_gram import SkipGram
from common.util import create_contexts_target, to_cpu, to_gpu
from dataset import aozorabunko  #★ Changed to use the corpus of Aozora Bunko

#Hyperparameter settings
window_size = 5
hidden_size = 100
batch_size = 100
max_epoch = 10

#Data reading
corpus, word_to_id, id_to_word = aozorabunko.load_data('train')  #★ Change corpus
vocab_size = len(word_to_id)

contexts, target = create_contexts_target(corpus, window_size)
if config.GPU:
    contexts, target = to_gpu(contexts), to_gpu(target)

#Generation of models etc.
model = CBOW(vocab_size, hidden_size, window_size, corpus)
# model = SkipGram(vocab_size, hidden_size, window_size, corpus)
optimizer = Adam()
trainer = Trainer(model, optimizer)

#Start learning
trainer.fit(contexts, target, max_epoch, batch_size)
trainer.plot()

#Save the data you need for later use
word_vecs = model.word_vecs
if config.GPU:
    word_vecs = to_cpu(word_vecs)
params = {}
params['word_vecs'] = word_vecs.astype(np.float16)
params['word_to_id'] = word_to_id
params['id_to_word'] = id_to_word
pkl_file = 'cbow_params.pkl'  # or 'skipgram_params.pkl'
with open(pkl_file, 'wb') as f:
    pickle.dump(params, f, -1)

In addition, it took about 8 hours to study in the environment at hand. result.png Next is the confirmation of the result. I changed ch04 / eval.py a little so that I can try various words from standard input. is the changed part.

ch04/eval.py


# coding: utf-8
import sys
sys.path.append('..')
from common.util import most_similar, analogy
import pickle


pkl_file = 'cbow_params.pkl'
# pkl_file = 'skipgram_params.pkl'

with open(pkl_file, 'rb') as f:
    params = pickle.load(f)
    word_vecs = params['word_vecs']
    word_to_id = params['word_to_id']
    id_to_word = params['id_to_word']

#most similar task ★ Changed the query to standard input
while True:
    query = input('\n[similar] query? ')
    if not query:
        break
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)


#analogy task ★ Changed the query to standard input
print('-'*50)
while True:
    query = input('\n[analogy] query? (3 words) ')
    if not query:
        break
    a, b, c = query.split()
    analogy(a, b, c,  word_to_id, id_to_word, word_vecs)

Below are the results of various trials.

First, check for similar words. For comparison, I also listed the count-based one I tried in Chapter 2. Also, the window size of CBOW was 5 in the book code, but I also tried 2 which is the same as the count base.

Similar words Chapter 2 count base
(Window size: 2)
CBOW
(Window size: 5)
CBOW
(Window size: 2)
you Wife: 0.6728986501693726
wife: 0.6299399137496948
K: 0.6205178499221802
father: 0.5986840128898621
I: 0.5941839814186096
you: 0.7080078125
Wife: 0.6748046875
wife: 0.64990234375
The handmaiden: 0.63330078125
I: 0.62646484375
Wife: 0.7373046875
you: 0.7236328125
wife: 0.68505859375
The person: 0.677734375
teacher: 0.666015625
Year Anti: 0.8162745237350464
hundred: 0.8051895499229431
Minutes: 0.7906433939933777
Eight: 0.7857747077941895
Circle: 0.7682645320892334
Circle: 0.78515625
Minutes: 0.7744140625
Year: 0.720703125
century: 0.70751953125
30:30: 0.70361328125
Tsubo: 0.71923828125
Meter: 0.70947265625
Minutes: 0.7080078125
Minutesの: 0.7060546875
Seconds: 0.69091796875
car door: 0.6294019222259521
Door: 0.6016885638237
Automobile: 0.5859153270721436
gate: 0.5726617574691772
curtain: 0.5608214139938354
Upper body: 0.74658203125
Warehouse: 0.744140625
Western-style building: 0.7353515625
Stairs: 0.7216796875
door: 0.71484375
Stairs: 0.72216796875
Automobile: 0.7216796875
cave: 0.716796875
underground: 0.7138671875
door: 0.71142578125
Toyota Toyotais not found Toyotais not found Toyotais not found
Morning night: 0.7267987132072449
Around: 0.660172164440155
Noon: 0.6085118055343628
evening: 0.6021789908409119
Next time: 0.6002975106239319
evening: 0.65576171875
Kunimoto: 0.65576171875
the first: 0.65087890625
The Emperor's Birthday: 0.6494140625
Next: 0.64501953125
evening: 0.68115234375
Noon: 0.66796875
Last night: 0.6640625
night: 0.64453125
Inside the gate: 0.61376953125
school Tokyo: 0.6504884958267212
Higher: 0.6290650367736816
Junior high school: 0.5801640748977661
University: 0.5742003917694092
Boarding house: 0.5358142852783203
University: 0.81201171875
Boarding house: 0.732421875
Sumita: 0.7275390625
student: 0.68212890625
Junior high school: 0.6767578125
Junior high school: 0.69677734375
University: 0.68701171875
recently: 0.6611328125
Tokyo: 0.65869140625
here: 0.65771484375
Zashiki Study: 0.6603355407714844
Sou side: 0.6362787485122681
Room: 0.6142982244491577
room: 0.6024710536003113
kitchen: 0.6014574766159058
floor: 0.77685546875
desk: 0.76513671875
threshold: 0.76513671875
Main hall: 0.744140625
Entrance: 0.73681640625
desk: 0.69970703125
floor: 0.68603515625
椽: 0.6796875
Study: 0.6748046875
Zoshigaya: 0.6708984375
kimono Beard: 0.5216895937919617
black: 0.5200990438461304
clothes: 0.5096032619476318
洋clothes: 0.48781922459602356
hat: 0.4869200587272644
Avoid: 0.68896484375
cold sweat: 0.6875
Awaken: 0.67138671875
underwear: 0.6708984375
Which means: 0.662109375
Costume: 0.68359375
Sightseeing: 0.68212890625
cotton: 0.6787109375
Play: 0.66259765625
Inkstone: 0.65966796875
I master: 0.6372452974319458
Extra: 0.5826579332351685
Kaneda: 0.4684762954711914
they: 0.4676626920700073
Labyrinth: 0.4615904688835144
master: 0.7861328125
they: 0.7490234375
Extra: 0.71923828125
Cat: 0.71728515625
Inevitable: 0.69287109375
master: 0.80517578125
they: 0.6982421875
Cat: 0.6962890625
wife: 0.6923828125
Lessing: 0.6611328125
Criminal Phantom: 0.6609077453613281
Thieves: 0.6374931931495667
Member: 0.6308270692825317
that person: 0.6046633720397949
Dive: 0.5931873917579651
Next time: 0.7841796875
boss: 0.75439453125
that person: 0.74462890625
jewelry: 0.74169921875
eagle, I: 0.73779296875
Fish fishing: 0.77392578125
that person: 0.74072265625
Coming soon: 0.7392578125
Light balloon: 0.7021484375
Intractable disease: 0.70166015625
order Talk: 0.6200630068778992
Consultation: 0.5290789604187012
Busy: 0.5178924202919006
Kindness: 0.5033778548240662
Lecture: 0.4894390106201172
Reminder: 0.6279296875
Appraisal: 0.61279296875
graduate: 0.611328125
General meeting: 0.6103515625
luxury: 0.607421875
Consultation: 0.65087890625
advice: 0.63330078125
Appraisal: 0.62451171875
Resignation: 0.61474609375
Proposal: 0.61474609375
Gunless gun Obsolete: 0.7266454696655273
Old-fashioned: 0.6771457195281982
saw: 0.6735808849334717
Nose breath: 0.6516652703285217
ignorance: 0.650424063205719
Creed: 0.7353515625
Top sorting: 0.7294921875
Protagonist: 0.693359375
Born: 0.68603515625
For sale: 0.68603515625
position: 0.724609375
At hand: 0.71630859375
Road next: 0.71142578125
Face: 0.70458984375
Subject: 0.69921875
Cat amen: 0.6659030318260193
Nobume: 0.5759447813034058
Ink: 0.5374482870101929
Status: 0.5352671146392822
usually: 0.5205280780792236
Wisdom: 0.728515625
I: 0.71728515625
Picture: 0.70751953125
dyspepsia: 0.67431640625
Gluttony: 0.66796875
I: 0.6962890625
Junior high school: 0.6513671875
love: 0.64306640625
they: 0.63818359375
Pig: 0.6357421875
Liquor book: 0.5834404230117798
tea: 0.469807893037796
Rest: 0.4605821967124939
Eat: 0.44864168763160706
rod: 0.4349029064178467
Drink: 0.6728515625
quarrel: 0.6689453125
food: 0.66259765625
Yamakoshi: 0.646484375
Soba: 0.64599609375
Violin: 0.63232421875
Monthly salary: 0.630859375
medicine: 0.59521484375
Grenade: 0.59521484375
Kira: 0.5947265625
cuisine Skein: 0.5380040407180786
Sign: 0.5214874744415283
original: 0.5175281763076782
Law: 0.5082278847694397
Shop: 0.5001937747001648
Hall: 0.68896484375
History: 0.615234375
novel: 0.59912109375
Literature: 0.5947265625
take: 0.59033203125
magazine: 0.666015625
Booth: 0.65625
Blacksmith: 0.61376953125
musics: 0.6123046875
Kimono: 0.6083984375

It's quite confusing as it was in Chapter 2. No superiority or inferiority can be given. The area where "cats" appear in "myself" shows the bias of the corpus. It seems that the reason is that the size of the corpus is too small, as it uses only the works of Soseki Natsume, Kenji Miyazawa, and Ranpo Edogawa.

Next is the analogy problem.

Analogy problem CBOW (window size: 5) CBOW (window size: 2)
Man:king=woman:? Nu: 5.25390625
Absent: 4.2890625
Zu: 4.21875
Ruru: 3.98828125
shit: 3.845703125
Big bird: 3.4375
Every moment: 3.052734375
back gate: 2.9140625
Kage: 2.912109375
Floor pillar: 2.873046875
body:face=Automobile:? Cop: 6.5
door: 5.83984375
Two people: 5.5625
Inspector: 5.53515625
Chief: 5.4765625
door: 3.85546875
hole: 3.646484375
Lamp: 3.640625
Inspector: 3.638671875
shoulder: 3.6328125
go:come=speak:? To tell: 4.6640625
eleven: 4.546875
Thirteen: 4.51171875
listen: 4.25
ask: 4.16796875
listen: 4.3359375
Regrettable: 4.14453125
Miya: 4.11328125
Say: 3.671875
eleven: 3.55078125
food:Eat=book:? Have: 4.3671875
Ask: 4.19140625
popularity: 4.1328125
Mountain road: 4.06640625
receive: 3.857421875
Prompt: 3.51171875
Go: 3.357421875
Say: 3.2265625
listen: 3.2265625
Start to get slimy: 3.17578125
summer:hot=winter:? Accumulate: 5.23828125
Teru: 4.171875
come: 4.10546875
Everywhere: 4.05859375
Go: 3.978515625
eleven: 4.29296875
Finished: 3.853515625
Thirteen: 3.771484375
Become: 3.66015625
Bad: 3.66015625

Suddenly the first problem is a mixture of low scores but distressing results. Insufficient training data can be scary. I feel that I have a glimpse of the background of the recent demand for "explainable AI."

Other results are tattered, but barely the correct answers were mixed in "body: face = car :?" And "go: come = speak :?". The appearance of "cops" and "inspectors" in "cars" is probably due to Ranpo Edogawa.

After all, I should have used the Japanese version of Wikipedia obediently, but it's an amanojaku: sweat: There are many people who are trying Wikipedia, so if you are interested, "wikipedia Japanese corpus" Please try google with.

4.4 Remaining themes for word2vec

Negative / positive judgment of email is explained as an example of transfer learning, but with the knowledge up to this chapter, even if words can be converted to fixed-length vectors, sentences such as email cannot be converted to fixed-length vectors. .. Therefore, we cannot challenge such a task yet.

Also, regarding the quality of distributed expressions, in the case of Japanese, the quality of prior word-separation seems to have a large effect. Some Japanese distributed expression models have been released, but when considering transfer learning of them, I think that it is a prerequisite to use the same word-separation mechanism (logic, dictionary contents, parameters, etc.). .. In that case, does it mean that transfer learning cannot be easily performed with tasks that deal with technical terms specific to the industry or individual companies, for example? Japanese is a lot of work.

4.5 Summary

The first half of this book is finally over, but considering that Chapter 1 was a review of the first volume, it may still be about 1/3. The destination seems to be long ...

That's all for this chapter. If you have any mistakes, I would be grateful if you could point them out.

Recommended Posts

Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 2
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 7
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 1
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 4
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 6
An amateur stumbled in Deep Learning from scratch Note: Chapter 1
An amateur stumbled in Deep Learning from scratch Note: Chapter 3
An amateur stumbled in Deep Learning from scratch Note: Chapter 7
An amateur stumbled in Deep Learning from scratch Note: Chapter 5
An amateur stumbled in Deep Learning from scratch Note: Chapter 4
An amateur stumbled in Deep Learning from scratch Note: Chapter 2
[Learning memo] Deep Learning made from scratch [Chapter 7]
Deep learning / Deep learning made from scratch Chapter 6 Memo
[Learning memo] Deep Learning made from scratch [Chapter 5]
[Learning memo] Deep Learning made from scratch [Chapter 6]
Deep learning / Deep learning made from scratch Chapter 7 Memo
[Learning memo] Deep Learning made from scratch [~ Chapter 4]
Deep Learning from scratch
Deep Learning from scratch ① Chapter 6 "Techniques related to learning"
Deep Learning from scratch Chapter 2 Perceptron (reading memo)
Deep Learning from scratch 1-3 chapters
Deep Learning / Deep Learning from Zero 2 Chapter 4 Memo
Deep Learning / Deep Learning from Zero Chapter 3 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 5 Memo
Create an environment for "Deep Learning from scratch" with Docker
Deep learning from scratch (cost calculation)
Deep Learning / Deep Learning from Zero 2 Chapter 7 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 8 Memo
Deep Learning / Deep Learning from Zero Chapter 5 Memo
Deep Learning / Deep Learning from Zero Chapter 4 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 3 Memo
Deep Learning memos made from scratch
Deep Learning / Deep Learning from Zero 2 Chapter 6 Memo
Write an impression of Deep Learning 3 framework edition made from scratch
Deep learning from scratch (forward propagation edition)
Deep learning / Deep learning from scratch 2-Try moving GRU
"Deep Learning from scratch" in Haskell (unfinished)
[Windows 10] "Deep Learning from scratch" environment construction
Learning record of reading "Deep Learning from scratch"
[Deep Learning from scratch] About hyperparameter optimization
"Deep Learning from scratch" Self-study memo (Part 12) Deep learning
Python vs Ruby "Deep Learning from scratch" Chapter 2 Logic circuit by Perceptron
Python vs Ruby "Deep Learning from scratch" Chapter 4 Implementation of loss function
"Deep Learning from scratch" self-study memo (unreadable glossary)
"Deep Learning from scratch" Self-study memo (9) MultiLayerNet class
An amateur tried Deep Learning using Caffe (Introduction)
Good book "Deep Learning from scratch" on GitHub
An amateur tried Deep Learning using Caffe (Practice)
[Learning memo] Deep Learning from scratch ~ Implementation of Dropout ~
An amateur tried Deep Learning using Caffe (Overview)
Python vs Ruby "Deep Learning from scratch" Summary
"Deep Learning from scratch" Self-study memo (10) MultiLayerNet class
"Deep Learning from scratch" Self-study memo (No. 11) CNN
Python vs Ruby "Deep Learning from scratch" Chapter 3 Implementation of 3-layer neural network
Deep Learning from scratch The theory and implementation of deep learning learned with Python Chapter 3
Lua version Deep Learning from scratch Part 5.5 [Making pkl files available in Lua Torch]
[Deep Learning from scratch] I implemented the Affine layer
"Deep Learning from scratch" Self-study memo (No. 19) Data Augmentation
"Deep Learning from scratch 2" Self-study memo (No. 21) Chapters 3 and 4
Application of Deep Learning 2 made from scratch Spam filter
[Deep Learning from scratch] I tried to explain Dropout