[Python] [Natural language processing] I tried Deep Learning ❷ made from scratch in Japanese ①

Introduction

Isn't it the most difficult thing for beginners (especially those who are self-taught) to collect a corpus (a large amount of sentences) when performing natural language processing using machine learning algorithms in Japanese?

Both "Deep Learning from scratch ❷ ~ Natural language processing ~" and other books, which are the subject of this article, basically deal with English corpora, and Japanese corpora with habits different from English. The current situation is that it is difficult to experience the processing of. (At least I had a lot of trouble because I couldn't collect any Japanese corpus.)

So, this time, I used the article of "livedoor news" (German communication), and probably if you are a machine learner, you got it once. I would like to implement a good book "Deep Learning from scratch ❷ ~ Natural language processing ~" that may be possible in Japanese.

Count-based natural language processing

This time, I will replace the corpus with Japanese and implement it for the following range of "Deep Learning ❷ made from scratch". In the case of Japanese, unlike English, preprocessing is troublesome, so please focus on that area.
Target
Subject: "Deep Learning from scratch ❷" Scope of this time: Chapter 2 Natural language and distributed representation of words 2.3 Count-based method ~ 2.4.5 Evaluation with PTB data set

environment

Mac OS(Mojave) Python3(Python 3.7.4) jupiter Notebook

0. Advance preparation

The original data is a text file created for each article delivery date, which is awkward as it is (probably more than 100 files), so first combine all the text files into one new text file .. You can combine multiple text files with the following command (for mac).

Terminal


$ cat ~/Directory name/*.txt >New text file name.txt


Reference
https://ultrabem-branch3.com/informatics/commands_mac/cat_mac [Digression] It's really just a digression, but personally I had a hard time with the above process. .. At first, I moved to the file and executed the command " cat * .txt> new text file name.txt ", but the process was completely finished probably because the directory name was not specified. (Maybe the wildcard was trying to read all the text files on my PC?), And at the end I got a bang-bang warning saying "Not enough space!" .. .. Please be careful.

1. 1. Data preprocessing

** ⑴ Text division **

python


import sys
sys.path.append('..')
import re
import pickle
from janome.tokenizer import Tokenizer
import numpy as np
import collections

with open("corpus/dokujo-tsushin/dokujo-tsushin-half.txt", mode="r",encoding="utf-8") as f: #Note 1)
    original_corpus = f.read()
    
text = re.sub("http://news.livedoor.com/article/detail/[0-9]{7}/","", original_corpus) #Note 2)
text = re.sub("[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\+[0-9]{4}","", text) #Note 3)
text = re.sub("[\f\n\r\t\v]","", text)
text = re.sub(" ","", text)
text = re.sub("[「」]","", text)
text = [re.sub("[()]","", text)]

#<Point>
t = Tokenizer()

words_list = []
for word in text:
    words_list.append(t.tokenize(word, wakati=True))
    
with open("words_list.pickle",mode='wb') as f:
    pickle.dump(words_list, f)

Note 1) This time, unlike "Deep Learning ❷ made from scratch", we have prepared the corpus by ourselves, so we will read the target corpus. Also, this time it is reading "dokujo-tsushin-half.txt", but although I was originally trying to read "dokujo-tsushin-all.txt", a warning was issued that it could not be read due to over capacity. Therefore, I abandoned "all" and used "half" (⓪ the combination of pre-prepared text files performed only half of all files).

** ** Since Japanese natural language processing is performed this time, sentences cannot be decomposed into words by the method described in the book (split into words by spaces). So, this time, I installed janome, a third-party library, and decomposed (separated) Japanese sentences into words. Also, since the amount of sentences is large and the number of words as a result is large, I decided to use pickle to save the execution result so that I do not have to execute it from the place where I write each minute. The pickle file can be called by doing the following, and it can be loaded many times faster than splitting from scratch.


with open('words_list.pickle', mode='rb') as f:
    words_list = pickle.load(f)

print(words_list) #If you do not need to display the load result, this description is unnecessary

# =>output
#[['friend', 'representative', 'of', 'speech', '、', 'Germany', 'woman', 'Is', 'How', 'Doing', 'hand', 'Is', '?', 'soon', 'June', '・', 'Bride', 'When', 'Call', 'To be', 'June', '。', 'Germany', 'woman', 'of', 'During ~', 'To', 'Is', 'myself', 'of', 'formula', 'Is', 'yet', 'Nana', 'ofTo', 'Call', 'Re', 'hand', 'Just', '…', '…', 'WhenIU', 'celebration', 'Poverty', 'Status', 'of', 'Man', 'Also', 'Many', 'of', 'so', 'Is', 'NanaI', 'I wonder', 'U', 'Or', '?', 'SaらTo', 'Attendance', 'Number of times', 'To', 'Stack', 'hand', 'Go', 'When', '、', 'こHmmNana', 'Please', 'ごWhen', 'To', 'Sa', 'To be', 'こWhen', 'Also', '少Nanaく', 'NanaI', '。', 'Please', 'But', 'is there', 'Hmm', 'Is', 'but', '…', '…', 'friend', 'representative', 'of', 'speech', '、', 'Finally', 'hand', 'くRe', 'NanaI', 'Or', 'Nana', '?', 'Sahand', 'そHmmNana', 'Whenき', '、', 'Germany', 'woman', 'Is', 'How', 'Correspondence', 'Shi', 'Cod', 'Good', 'Or', '?', 'Recently', 'Is', 'When', 'the Internet', 'etc', 'so', 'Search', 'すRe', 'If', 'friend', 'representative', 'speech', 'for', 'of', 'Example sentence', 'site', 'But', 'TaくSaHmm', 'Out', 'hand', 'come', 'ofso', '、', 'そReら', 'To', 'reference', 'To', 'すRe', 'If', '、', 'Safe', 'Nana', 'Alsoof', 'Is', 'Who', 'soAlso', 'Create', 'soきる', '。', 'ShiOrShi', 'Yuri', 'SaHmm', '33', 'age', 'Is', 'Net', 'To', 'reference', 'To', 'Shi', 'hand', 'Create', 'Shi', 'Ta', 'Alsoofof', 'こRe', 'so', '本当To', 'Good', 'of', 'Or', 'anxiety', 'soShi', 'Ta', '。', '一Man暮らShi', 'Nana', 'ofso', '聞Or', 'Se', 'hand', 'Impressions', 'To', 'Ichi', 'hand', 'くTo be', 'Man', 'Also', 'I', 'NanaI', 'Shi', '、', 'Or', 'When', 'Ichi', 'hand', 'other', 'of', 'friend', 'To', 'Take the trouble', '聞Or', 'Seる', 'of', 'Also', 'How', 'Or', 'When',・ ・ ・ Omitted below

** ⑵ Create a list with IDs for words **

def preprocess(text):
    word_to_id = {}
    id_to_word = {}
    
    #<Point>
    for words in words_list:
        for word in words:
            if word not in word_to_id:
                new_id = len(word_to_id)
                word_to_id[word] = new_id
                id_to_word[new_id] = word
                
    corpus = [word_to_id[w] for w in words for words in words_list]
    
    return corpus, word_to_id, id_to_word

corpus, word_to_id, id_to_word = preprocess(text)

print('corpus size:', len(corpus))
print('corpus[:30]:', corpus[:30])
print()
print('id_to_word[0]:', id_to_word[0])
print('id_to_word[1]:', id_to_word[1])
print('id_to_word[2]:', id_to_word[2])
print()
print("word_to_id['woman']:", word_to_id['woman'])
print("word_to_id['marriage']:", word_to_id['marriage'])
print("word_to_id['husband']:", word_to_id['husband'])

# =>output
# corpus size: 328831
# corpus[:30]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 5, 6, 2, 22, 23, 7, 24, 2]

# id_to_word[0]:friend
# id_to_word[1]:representative
# id_to_word[2]:of

# word_to_id['woman']: 6
# word_to_id['marriage']: 456
# word_to_id['husband']: 1453


Point
-The preprocess function is basically the same as this book, but this time the word division of the sentence has already been done, so that part has been deleted, and unlike this book, the for sentence is rotated twice. The point is to give an id. The reason why the for sentence is rotated twice is that the word is included in the double list because it is divided by the word division unlike this book.

2. Evaluation

I will omit the details of the following because there are many parts that overlap with the contents of the book, but I would appreciate it if you could refer to it because some comments are included.

#Creating a co-occurrence matrix
def create_co_matrix(corpus, vocab_size, window_size=1):
    corpus_size = len(corpus)
    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)
    
    for idx, word_id in enumerate(corpus):
        for i in range(1, window_size + 1):
            left_idx = idx - i
            right_idx = idx + i
            
            if left_idx >= 0:
                left_word_id = corpus[left_idx]
                co_matrix[word_id, left_word_id] += 1
                
            if right_idx < corpus_size:
                right_word_id = corpus[right_idx]
                co_matrix[word_id, right_word_id] += 1
            
    return co_matrix

#Judgment of similarity between vectors (cos similarity)
def cos_similarity(x, y, eps=1e-8):
    nx = x / (np.sqrt(np.sum(x ** 2)) + eps)
    ny = y / (np.sqrt(np.sum(y ** 2)) + eps)
    return np.dot(nx, ny)

#Ranking the similarity between vectors
def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
    if query not in word_to_id:
        print('%s is not found' % query)
        return

    print('\n[query] ' + query)
    query_id = word_to_id[query]
    query_vec = word_matrix[query_id]

    vocab_size = len(id_to_word)

    similarity = np.zeros(vocab_size)
    for i in range(vocab_size):
        similarity[i] = cos_similarity(word_matrix[i], query_vec)

    count = 0
    for i in (-1 * similarity).argsort():
        if id_to_word[i] == query:
            continue
        print(' %s: %s' % (id_to_word[i], similarity[i]))

        count += 1
        if count >= top:
            return

#Improving word relevance indicators using positive mutual information (PPMI)
def ppmi(C, verbose=False, eps = 1e-8):
    M = np.zeros_like(C, dtype=np.float32)
    N = np.sum(C)
    S = np.sum(C, axis=0)
    total = C.shape[0] * C.shape[1]
    cnt = 0

    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            pmi = np.log2(C[i, j] * N / (S[j]*S[i]) + eps)
            M[i, j] = max(0, pmi)

            if verbose:
                cnt += 1
                if cnt % (total//100) == 0:
                    print('%.1f%% done' % (100*cnt/total))
    return M

window_size = 2
wordvec_size = 100

corpus, word_to_id, id_to_word = preprocess(text)
vocab_size = len(word_to_id)
print('counting  co-occurrence ...')
C = create_co_matrix(corpus, vocab_size, window_size)
print('calculating PPMI ...')
W = ppmi(C, verbose=True)

print('calculating SVD ...')
try:
    #Dimensionality reduction with SVD using sklearn
    from sklearn.utils.extmath import randomized_svd
    U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5,
                             random_state=None)
except ImportError:
    U, S, V = np.linalg.svd(W)

word_vecs = U[:, :wordvec_size]

querys = ['Female', 'marriage', 'he', 'Mote']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)

# =>Below, the output result
"""
[query]Female
male: 0.6902421712875366
Etc.: 0.6339510679244995
model: 0.5287646055221558
generation: 0.5057054758071899
layer: 0.47833186388015747

[query]marriage
love: 0.5706729888916016
Dating: 0.5485040545463562
Opponent: 0.5481910705566406
 ?。: 0.5300850868225098
Ten: 0.4711574614048004

[query]he
Girlfriend: 0.7679144740104675
boyfriend: 0.67448890209198
husband: 0.6713247895240784
parent: 0.6373711824417114
Former: 0.6159241199493408

[query]Mote
Ru: 0.6267833709716797
Consideration: 0.5327887535095215
Twink: 0.5280393362045288
Girls: 0.5190156698226929
bicycle: 0.5139431953430176
"""
​

Recommended Posts

[Python] [Natural language processing] I tried Deep Learning ❷ made from scratch in Japanese ①
Python: Deep Learning in Natural Language Processing: Basics
Deep Learning 2 Made from Zero Natural Language Processing 1.3 Summary
[Deep Learning from scratch] I tried to explain Dropout
Python: Deep learning in natural language processing: Implementation of answer sentence selection system
Deep Learning memos made from scratch
I tried to implement Perceptron Part 1 [Deep Learning from scratch]
[Learning memo] Deep Learning made from scratch [Chapter 7]
Deep learning / Deep learning made from scratch Chapter 6 Memo
[Learning memo] Deep Learning made from scratch [Chapter 5]
[Deep Learning from scratch] I tried to explain the gradient confirmation in an easy-to-understand manner.
[Learning memo] Deep Learning made from scratch [Chapter 6]
"Deep Learning from scratch" in Haskell (unfinished)
Deep learning / Deep learning made from scratch Chapter 7 Memo
I tried natural language processing with transformers.
[Learning memo] Deep Learning made from scratch [~ Chapter 4]
[Deep Learning from scratch] I tried to implement sigmoid layer and Relu layer.
[Python] I played with natural language processing ~ transformers ~
Python vs Ruby "Deep Learning from scratch" Summary
Python: Natural language processing
Deep Learning from scratch
I tried deep learning
I'm not sure, but I feel like I understand Deep Learning (I tried Deep Learning from scratch)
[Deep Learning from scratch] Speeding up neural networks I explained back propagation processing
[For beginners] After all, what is written in Deep Learning made from scratch?
[Deep Learning from scratch] I implemented the Affine layer
Application of Deep Learning 2 made from scratch Spam filter
Deep Learning from scratch 1-3 chapters
I wrote python in Japanese
"Deep Learning from scratch" Self-study memo (No. 16) I tried to build SimpleConvNet with Keras
"Deep Learning from scratch" Self-study memo (No. 17) I tried to build DeepConvNet with Keras
I tried 100 language processing knock 2020
I understand Python in Japanese!
An amateur stumbled in Deep Learning from scratch Note: Chapter 1
An amateur stumbled in Deep Learning from scratch Note: Chapter 3
An amateur stumbled in Deep Learning from scratch Note: Chapter 7
An amateur stumbled in Deep Learning from scratch Note: Chapter 5
An amateur stumbled in Deep Learning from scratch Note: Chapter 4
An amateur stumbled in Deep Learning from scratch Note: Chapter 2
I tried to divide with a deep learning language model
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
"Deep Learning from scratch" Self-study memo (Part 8) I drew the graph in Chapter 6 with matplotlib
[Deep Learning from scratch] Implement backpropagation processing in neural network by error back propagation method
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
[Natural language processing] Preprocessing with Japanese
Deep learning from scratch (cost calculation)
I tried 100 language processing knock 2020: Chapter 3
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
I tried 100 language processing knock 2020: Chapter 1
I tried deep learning using Theano
I tried Line notification in Python
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
Python beginner tried 100 language processing knock 2015 (00 ~ 04)
Japanese Natural Language Processing Using Python3 (4) Sentiment Analysis by Logistic Regression
Dockerfile with the necessary libraries for natural language processing in python
Why ModuleNotFoundError: No module named'dataset.mnist' appears in "Deep Learning from scratch".
Write an impression of Deep Learning 3 framework edition made from scratch
[Deep Learning from scratch] About the layers required to implement backpropagation processing in a neural network
Chapter 1 Introduction to Python Cut out only the good points of deep learning made from scratch