Introduction

This is the 7th day of "Natural Language Processing Advent Calendar 2019". I felt a problem in the article "Sentiment analysis to score the negative / positive degree of news articles", ** "A lot of polar dictionaries that can be used" No problem "** I tried to solve. This time, we will try to automatically generate a polarity dictionary using fasttext.

reference

I referred to the following when learning fasttext.

-Summary of procedures for generating a learning model of natural language (Japanese) with fastText -Trained model of fastText has been released

What is sentiment analysis

Sentiment analysis summary

Sentiment analysis is a method of analyzing the emotions of the written content of various text information using text mining and machine learning techniques. One-axis analysis of positive or negative is the most orthodox impression, but there are some that go into more detailed analysis of emotions.

Sentiment analysis method

The most common method of sentiment analysis is ** focusing on the words contained in a sentence **, which is based on the idea that positive (negative) sentences should contain specific words. A list of such specific words is called a ** polarity dictionary **, and this sentiment analysis uses this polarity dictionary.

Polar dictionary

About polarity dictionary

Polar dictionaries are ** dictionaries with semantic flags such as "positive" and "negative" for each word or term **. Some are simply flags such as "positive" and "negative", while others are given a degree of positiveness or negativeness as discrete score values.

Challenges in sentiment analysis

There are few polar dictionaries that can be used conveniently

I searched for various polar dictionaries because I wanted to analyze emotions, but there were only the following two polar dictionaries that I could actually use. Word Emotion Polarity Correspondence Table shows the discrete values of negative and positive (-1 to 1) for a large number of words. It is very convenient because it is given (within the range of), but it is not available for commercial use.

-Word Emotion Polarity Correspondence Table -Japanese Evaluation Polarity Dictionary

Very troublesome to tune

In fact, even if you say that you assign a value of the degree of negative / positive to each word **, should the word be judged as positive or reflected as negative depending on the object to be analyzed, the context, and the viewpoint of performing negative / positive inversion? Is different. ** Therefore, the polarity dictionary should be tuned entirely according to the task, but it takes a lot of time to do it.

Solving the above problems

Automatic generation of polarity dictionary using fasttext

As a solution to the above problem, I would like to create a program that can easily and automatically generate a polar dictionary and is easy to tune. ** **

Automatic generation of polarity dictionary

Overview of auto-generated program

I created a polar dictionary automatic generation program with the following mechanism.

Specify multiple ** "very positive words" ** and ** "very negative words" ** in the document group you want to judge as negative or positive.
Using the fasttext model that fully learned the wikipedia Japanese page, the average similarity with the above ** "very positive words" ** and ** "very negative" for the morphemes in each sentence Measure the average degree of similarity with "words" **.
Use the polarity with the higher average similarity. If it is "positive", the similarity value is used as it is, and if it is "negative", a minus is added to make it the negative / positive score of the morpheme.
Convert the score scale of all words from -1 to 1 to complete the polarity dictionary.

Implementation of polar dictionary automatic generation program

About fast text

This time, fasttext is used to measure the similarity between words. I also compared it with Word2vec, but fasttext seems to be more resistant to counter-words, so I chose fasttext. I created the fasttext model from scratch, but it took me 300 hours to learn ... Since some people have published the trained model (fastText trained model has been published etc. ) So basically, I think that there is no problem in using it.

Also, there are people who have just written about fasttext on the Advent Calendar, so please have a look. Let's use the distributed expression of words quickly with fastText!

Automatic judgment of negative / positive degree

First, create a program that determines the degree of negative / positiveness of a specific word.

import gensim
#Load fasttext model
model = gensim.models.KeyedVectors.load_word2vec_format('./wikimodel_20191102.bin', binary=True)

#Arbitrarily specify "very positive words" and "very negative words"
posi_list = ['Excellent', 'good','Rejoice','praise', 'Congratulations','smart','good', 'Suitable','Tensei',
 'celebrate', 'Achievement','award','happy','joy','wit and intelligence','Virtue', 'talent','Great','aromatic','Honor',
 'Appropriate','worship','help','I'm pulling out','Shimizu','Majestic','Assortment','Fortunately','Kitcho','excel']

nega_list = ['bad', 'die', 'sick', 'terrible', 'Swear', 'Soak', 'Lowly',
 'poor', 'suffer', 'painful', 'Attach', 'Strict', 'difficult', 'kill', 'hard', 'Rough',
 'cruel', 'blame', 'enemy', 'Disobey', 'Mocking', 'Suffering', 'Spicy', 'Lonely', 'punishment', 'Unfaithful',
 'Cold', 'worthless', 'Sorry']

def posi_nega_score(x):
    #Judgment of positive degree
    posi = []
    for i in posi_list:
        try:
            n = model.similarity(i, x)
            posi.append(n)
        except:
            continue
    try:
        posi_mean = sum(posi)/len(posi)
    except:
        posi_mean = 0

    #Judgment of negative degree
    nega = []
    for i in nega_list:
        try:
            n = model.similarity(i, x)
            nega.append(n)
        except:
            continue
    try:
        nega_mean = sum(nega)/len(nega)
    except:
        nega_mean = 0
    if posi_mean > nega_mean:
        return posi_mean
    if nega_mean > posi_mean:
        return -nega_mean
    else:
        return 0

If you use the above program to judge the negative / positive of a word, it will look like this.

print(posi_nega_score('excellence'))

0.2679512406197878

print(posi_nega_score('Disagreeable'))

-0.2425743742631032

Polarity dictionary automatic generation program

This time, we will use "livedoor news corpus" for the dataset. For details of the dataset and the method of morphological analysis, please refer to Posted in the previously posted article. I will. The result of morphological analysis is output like this.

スクリーンショット 2019-12-07 8.51.39.png

A polarity dictionary is generated based on this data.


import pandas as pd
ddf = pd.read_csv('news_word.csv')

#Assign a score to each word
ddf['Score'] = ddf['word'].apply(lambda x : posi_nega_score(x))

import numpy as np
#Given score-Adjust from 1 to 1
score = np.array(ddf['Score'])
score_std = (score - score.min())/(score.max() - score.min())
score_scaled = score_std * (1 - (-1)) + (-1)
ddf['Score'] = score_scaled

Click here for the ** Negative Word Top 20 ** of the polarity dictionary created by the above program.

スクリーンショット 2019-12-07 8.58.11.png

Click here for the ** Positive Word Top 20 ** of the polarity dictionary created by the above program.

スクリーンショット 2019-12-07 9.00.54.png

You can see that words like that are coming in properly. There are some words that are not specified as negative words or positive words, so I think the accuracy is relatively good.

Next Thank you for watching until the end. This was my first time to participate in the "Advent Calendar", but I would like to continue to actively participate in it. Tomorrow is oumugai mori! !!

[PYTHON] Automatically generate a polarity dictionary used for sentiment analysis