[PYTHON] How about polarity analysis with "order" added?

I learned about Japanese polarity analysis (sentiment analysis) with a little rhythm, and even though I was an amateur, I played around with it.

In general polarity analysis (sentiment analysis), the score of each word compared with the polarity dictionary is averaged for the entire sentence. But does that always give you an intuitive score ...?

Natural Language Processing Advent Calendar 2019 It is the 4th day. Yesterday kabayan55 [“A story for“ people who want to study natural language processing ”]](https://kabayan55.hatenablog.com/entry / 2019/12/03/001554) and Kazumasa Yamamoto [“Learning document categorization with spaCy CLI”](https://qiita.com / kyamamoto9120 / items / 84d62c3b33fb77c03fbe) It was.

This is my first time to participate in the Advent Calendar, but I look forward to working with you.

First, what is polarity analysis?

Many people think that there are ** positive impression words ** and ** negative impression words . For example, “ bright ” has a positive impression on many people, and “ dark **” has a negative impression.

** Polarity analysis ** is to summarize the impressions (polarity) of each word and analyze the polarity of each word appearing in the sentence.

Polarity analysis is used as is as a stepping stone for ** sentiment analysis **. In the case of an overall positive sentence, it is a reasonable reasoning that the writer (speaker) has positive emotions. ~~ It's kind of like a Kyoto person ... ~~

In this series of processes, ** the validity of the dictionary **, that is, whether the polarity of the word is evaluated correctly ** is a very important point of view, but this time I will not touch on it and then * We will focus on the part that considers the polarity of the entire sentence **.

Polarity evaluation library for Python “oseti”

There are various polar dictionaries, but this time [Japanese evaluation polarity dictionary published in Inui-Suzuki laboratory](http://www.cl.ecei.tohoku.ac.jp/index.php?Open% I'm going to use 20Resources% 2FJapanese% 20Sentiment% 20Polarity% 20Dictionary). This dictionary is scored word by word as 1 for positive and -1 for negative.

A library that makes this polarity dictionary easily available in Python is “oseti”, which is also announced in Qiita. It calculates the score of the sentence through morphological analysis by MeCab.

Let's use it as a trial. Installation of MeCab is omitted. Oseti is registered on PyPI, so you can install it with the pip command.

pip install oseti

You can calculate the score by creating an Analyzer` instance and making a method call to it.

import oseti

analyzer = oseti.Analyzer()

For the time being, let's pass on the beginning of that famous sentence.

I am a cat. There is no name yet. I have no idea where I was born. I remember only crying in a dim and damp place. I saw human beings for the first time here. Moreover, I heard later that it was the most evil race of human beings called Shosei.

―― “I am a cat” ――Natsume Soseki

iamcat = 'I am a cat. There is no name yet.\n \
I have no idea where I was born.\
I remember only crying in a dim and damp place.\
I saw human beings for the first time here.\
Moreover, I heard later that it was the most evil race of human beings called Shosei.'

iamcat_score = analyzer.analyze(iamcat)
print(iamcat_score)

[0, 0, 0, -1.0, 0, 1.0]

3rd sentence,

I remember crying in a dim and damp place.

I can understand that is judged to be negative, but the 5th sentence,

Moreover, I heard later that it was the most evil race of human beings called Shosei.

It is a little strange that is treated as positive. In this case, let's call the method that performs detailed analysis.

iamcat_detail = analyzer.analyze_detail(iamcat)
print(iamcat_detail[3])
print(iamcat_detail[5])

{'positive': [], 'negative': ['dim'], 'score': -1.0}
{'positive': ['Ichiban'], 'negative': [], 'score': 1.0}

Well, that seems to be the reason why the word number one is treated positively.

Let's pass a sentence with a mixture of positive and negative words.

test_text = 'I bought a new smartphone, and although it cost me a lot, the operation is light and comfortable.'
test_score = analyzer.analyze(test_text)
test_detail = analyzer.analyze_detail(test_text)
print(test_score)
print(test_detail)

[0.3333333333333333]
[{'positive': ['Light', 'comfortable'], 'negative': ['Spending'], 'score': 0.3333333333333333}]

The score of each word is ** averaged **. The average seems to be a ** standard method ** that is often used in other articles related to polarity analysis.

If you are in trouble on average

By the way, there are fictional characters, Mr. Sato and Mr. Suzuki. The conversation between the two is like this.

Sato "I went to A city for a business trip the other day. ** I was tired because there were many people who got lost, but the food was delicious and the scenery was good **."

Suzuki "That was good. I went to B City, ** the scenery was good and the food was delicious, but I was tired and lost wherever I went **."

Mr. Sato seems to have enjoyed sightseeing **, but Mr. Suzuki seems to have been ** not good **.

However, polar analysis of these two statements results in this.

sato_remark = 'I was tired because there were many people who often got lost, but the food was delicious and the scenery was good.'
suzuki_remark = 'The scenery was good and the food was delicious, but I was tired and lost wherever I went'

sato_score = analyzer.analyze(sato_remark)
suzuki_score = analyzer.analyze(suzuki_remark)
print(F'Sato: {sato_score}')
print(F'Suzuki: {suzuki_score}')

Sato: [0.0]
Suzuki: [0.0]

Both of them have become ** Pramai Zero ** in exactly the same way. You can see the reason for this by calling the detailed analysis method.

sato_detail = analyzer.analyze_detail(sato_remark)
suzuki_detail = analyzer.analyze_detail(suzuki_remark)
print(F'Sato: {sato_detail}')
print(F'Suzuki: {suzuki_detail}')

Sato: [{'positive': ['delicious', 'view'], 'negative': ['Get lost', 'Tired'], 'score': 0.0}]
Suzuki: [{'positive': ['view', 'delicious'], 'negative': ['Tired', 'Get lost'], 'score': 0.0}]

Yes, they both use the same scored words. In short, what you are saying is the same.

Still, we feel that Mr. Sato's remarks are positive and Mr. Suzuki's remarks are negative. What is the reason?

One hypothesis would be the order of the topics. At least Japanese people seem to be more likely to put what they want to say later than **. It is natural for Mr. Sato to think that hesitation and tiredness are not so important compared to the deliciousness of food and the beauty of the scenery, and the opposite for Mr. Suzuki.

Weight in order of appearance

So let's add a method that weights the words in the order they appear.

import neologdn
import sengiri
def analyze_with_weight(self, text, weightfunc=None):
    if weightfunc is None:
        weightfunc = lambda x: [1 / x for _ in range(x)]
    text = neologdn.normalize(text)
    scores = []
    for sentence in sengiri.tokenize(text):
        polarities = self._calc_sentiment_polarity(sentence)
        if polarities:
            weights = weightfunc(len(polarities))
            scores.append(sum(weights[i] * p[1] for (i,p,) in enumerate(polarities)))
        else:
            scores.append(0)
    return scores
setattr(oseti.Analyzer, 'analyze_with_weight', analyze_with_weight)

If you give an integer to weightfunc, you give it a function that returns a one-dimensional sequence with that number of elements (a list, tuple, or NumPy tensor) (I expect the total to be 1). I haven't checked it in particular). This is the weight according to the order of appearance. If omitted, the weight will be uniform. This is the same as taking the average.

For example, give the following as a linearly increasing weight:

from fractions import Fraction
def linear_weight(x):
    l = [i for i in range(1, x + 1)]
    s = sum(l)
    return [Fraction(i, s) for i in l]

Using this weight to analyze their remarks, it looks like this.

sato_score = analyzer.analyze_with_weight(sato_remark, linear_weight)
suzuki_score = analyzer.analyze_with_weight(suzuki_remark, linear_weight)
print(F'Sato: {sato_score}')
print(F'Suzuki: {suzuki_score}')

Sato: [Fraction(2, 5)]
Suzuki: [Fraction(-2, 5)]

…… Oops, it remained a rational number.

sato_score = [float(i) for i in sato_score]
suzuki_score = [float(i) for i in suzuki_score]
print(F'Sato: {sato_score}')
print(F'Suzuki: {suzuki_score}')

Sato: [0.4]
Suzuki: [-0.4]

Therefore, Mr. Sato's remark was judged to be relatively positive, and Mr. Suzuki's remark was judged to be relatively negative. Isn't this more intuitive than the first result?

But words are not that simple

However, this does not mean that this method will solve everything.

score = analyzer.analyze_with_weight('If anything, it's a pleasant tiredness.', linear_weight)
score = [float(i) for i in score]
print(score)

[-0.3333333333333333]

In this example, oxymoron, etc., the adjective that comes before may really mean. It may be rather violent to simply make it heavier behind without considering such subtleties. You have to read “Context”.

After all, if you want to get a more “correct” evaluation, the method will become more and more complicated. But,

Statistically sophisticated or complex methods do not necessarily provide more accurate forecasts than simpler ones. --Statistically sophisticated or complex methods do not always make more accurate forecasts than simple methods.

――“The M3-Competition: results, conclusions and implications” - Spyros Makridakis, Michele Hibon

Similarly, taking too complicated a method does not always give good results. Rather, it doesn't make sense to pay a cost that isn't worth it, even if you get good results.

In that sense, “average” is extremely simple and may be a good way to get decent results.

in conclusion

So, it was a raccoon dog miscellaneous sentence like ** I don't even know the character of natural language processing **.

Tomorrow is Mr. Mona Cat.

Next: Let's use fastText with a quick and distributed expression of words!