[PYTHON] Predict the number of cushions that can be received as laughter respondents with Word2Vec + Random Forest

Motivation

I really like the shoten, and when I go home, I endlessly watch the recorded shoten. Temporarily freed from research on natural language processing, take a break and laugh at ...

Chachara ~ Charara ** Cha ** ** Cha! !! !! !! !! ** **

inside-head


Get the answer:(Enter an answer that looks like a laughing point)
Yamada-kun! Take one of Sanyutei Enraku-san!

Puff! (Inspirational man)

What is a laughing point?

・ Entertainment programs that have been running for a long time ・ Ogiri, a professional rakugo storyteller who gives stylish answers to the subject, is famous. ・ If you give an interesting answer, you will get a cushion. If you slip or give a rude answer, you will be taken off the cushion ・ Collect 10 cushions and get amazing products

Purpose

If you enter a laughing answer, ・ Who's closest to the answer ・ How many cushions can I get? Is predicted and displayed.

Procedure ① Sentence collection, preprocessing

Past broadcast contents released by NTV [http://www.ntv.co.jp/sho-ten/02_week/kako_2011.html](http://www.ntv.co.jp/sho-ten / 02_week / kako_2011.html) for 2011 ·Answer ·Respondent ・ Increase / decrease of cushions Record. Answers other than the six main respondents (announcer Ogiri, young Ogiri, etc.) were excluded.

The number of responses collected was 1773. It was surprising that there was not much difference, with about 330 answers per person.

I removed symbols and strange spaces from this sentence, eliminated pictograms with emoji, and unified the case with mojimoji.

Step ② Word2Vec

I converted a sentence into a 200-dimensional vector with Word2Vec. Vectorize words in sentences using Japanese Wikipedia entity vector. (I thought it would be a good idea to use Wikipedia for the answer of laughter with many colloquial expressions, but I could not beat the learned model that can be used easily) The added average of the word vectors taken from the answers was used as the vector of the answers.

Step ③ Learn at Random Forest

I created a classifier in a random forest. Random forest is very good because it is lightly calculated. We also optimized the parameters with Gridsearch CV. The parameter search range is Maximum depth: 1 ~ 10 Number of decision trees: 1 ~ 1000 is.

After searching for parameters, extract the one with the highest accuracy and use Pickle to save the classifier.

gridsearch.py


grid_mori_speaker = GridSearchCV(RandomForestClassifier() , grid_param_mori() , cv=10 ,scoring = 'accuracy', verbose = 3,n_jobs=-1)
grid_mori_speaker.fit(kotae_vector,shoten.speaker)
grid_mori_speaker_best = grid_mori_speaker.best_estimator_
with open('shoten_speaker_RF.pickle',mode = 'wb') as fp :
    pickle.dump(grid_mori_speaker_best,fp)

Calculate this based on the number of cushions that can be identified as the respondent, and save it as a pickle file.

By the way, the highest correct answer rate in the respondent discrimination was 0.25, and the number of cushions was 0.50. It's still quite low, so I'd like to improve steps (2) to (3) to improve accuracy.

Step ④ Make a program to input sentences and classify

Create a program that allows you to manually enter sentences and display the classification results. What I'm doing is decompressing the classification machine that has been made into a pickle file, inserting the sentence vector, and outputting the classification result.

shoten.py


#usr/bin/env python
#coding:utf-8

import numpy as np
import re
import emoji
import mojimoji
import MeCab
from gensim.models import KeyedVectors
import pickle

mecab = MeCab.Tagger("")#If you use Neologd dictionary, please enter the path
model_entity = KeyedVectors.load_word2vec_format("entity_vector.model.bin",binary = True)

with open('shoten_speaker_RF.pickle', mode='rb') as f:
    speaker_clf = pickle.load(f)
with open('shoten_zabuton_RF.pickle', mode='rb') as f:
    zabuton_clf = pickle.load(f)
    
def text_to_vector(text , w2vmodel,num_features):
    kotae = text
    kotae = kotae.replace(',','、')
    kotae = kotae.replace('/n','')
    kotae = kotae.replace('\t','')
    kotae = re.sub(r'\s','',kotae)
    kotae = re.sub(r'^@.[\w]+','',kotae)
    kotae = re.sub(r'https?://[\w/:%#\$&\?\(\)~\.=\+\-]+','',kotae)
    kotae = re.sub(r'[!-/:-@[-`{-~ ]+','',kotae)
    kotae = re.sub(r'[:-@, [] ★ ☆ "". , ・]+','',kotae)
    kotae = mojimoji.zen_to_han(kotae,kana = False)
    kotae = kotae.lower()
    kotae = ''.join(['' if character in emoji.UNICODE_EMOJI else character for character in kotae])
    kotae_node = mecab.parseToNode(kotae)
    kotae_line = []
    while kotae_node:
        surface = kotae_node.surface
        meta = kotae_node.feature.split(",")
        if not meta[0] == 'symbol' and not meta[0] == 'BOS/EOS':
            kotae_line.append(kotae_node.surface)
        kotae_node = kotae_node.next
    feature_vec = np.zeros((num_features), dtype = "float32")
    word_count = 0
    for word in kotae_line:
        try:
            feature_vec = np.add(feature_vec,w2vmodel[word])
            word_count += 1
        except KeyError :
            pass
        if len(word) > 0:
            if word_count == 0:
                feature_vec = np.divide(feature_vec,1)
            else:
                feature_vec = np.divide(feature_vec,word_count)
        feature_vec = feature_vec.tolist()
    return feature_vec

def zabuton_challenge(insert_text):
    vector = np.array(text_to_vector(insert_text,model_entity,200)).reshape(1,-1)
    if(zabuton_clf.predict(vector)[0] == 0):
        print(str(speaker_clf.predict(vector)[0])+"I will not give you a cushion")
    elif(zabuton_clf.predict(vector)[0] < 0):
        print("Yamada-kun!"+str(speaker_clf.predict(vector)[0])+"to Mr. or Ms"+str(zabuton_clf.predict(vector)[0])+"Give me a piece!")
    elif(zabuton_clf.predict(vector)[0] > 0):
        print("Yamada-kun!"+str(speaker_clf.predict(vector)[0])+"Of"+str(zabuton_clf.predict(vector)[0] * -1)+"Take one!")
    else:
        print("Yamada-kun! Take all the cushions of the developer who made the classifier that gives an error!")
        
if __name__ == "__main__":
    while True:
        text = input("Please answer:")
        zabuton_challenge(text)

Please forgive me for not writing many comments now. The content of the function text_to_vector () is a modification of the code written in one of the blog posts (I'm sorry I lost the source).

move

You can enter text by starting shoten.py. (It takes a while to read the Pickle file first, but ...)

Enter Answer to the 1st question of the 2395th broadcast on December 29, 2012 as test data. Try.

コメント 2019-12-01 161726.png

Only Koyuza, Enraku, and Kikuo are output, but depending on the answer, the other three are also output. The reason why the correct answer rate is not good is that the accuracy of the created classifier is poor. I also think that the reason why no one received cushions is that more than half of the collected data was 0 cushions.

Improvement plan

・ Collect more data (The collection source site contains the answers from 2011 to April 2014. I want to collect more data and hit with the number of data ~~ It's very troublesome ~~) ・ Use a corpus that is strong in colloquial expressions (I couldn't find anything other than the Wikipedia corpus, so if you know something that is strong in colloquial expressions, please let me know) ・ Change the classification algorithm (I'm thinking of trying BERT because I feel like using BERT in my research)

Recommended Posts

Predict the number of cushions that can be received as laughter respondents with Word2Vec + Random Forest
[Python] A program that finds the maximum number of toys that can be purchased with your money
Predict the number of people infected with COVID-19 with Prophet
[Python] A program to find the number of apples and oranges that can be harvested
Format summary of formats that can be serialized with gensim
Measure the importance of features with a random forest tool
Comparison of 4 styles that can be passed to seaborn with set_context
[Python] Code that can be written with brain death at the beginning when scraping as a beginner
[Python] A program that calculates the number of socks to be paired
Maximum number of function parameters that can be defined in each language
Article that can be a human resource who understands and masters the mechanism of API (with Python code)
Count the number of characters with echo
[Python] The movement of the decorator that can be understood this time ② The decorator that receives the argument
Tensorflow, it seems that even the eigenvalues of the matrix can be automatically differentiated
Python tricks: a combination of enumerate () and zip (), checking if a string can be converted to a number, sorting the string as a number
[Python] I examined the practice of asynchronous processing that can be executed in parallel with the main thread (multiprocessing, asyncio).