[PYTHON] I tried to verify whether the Natural Language API (sentiment analysis) supports net slang.

Overview

--Introduction

Evaluation method --Procedure --Collect text data with twitter API
Preprocessing
Google Cloud Natural Language API --Histogram --Test -Is the number of samples sufficient? --Improvements
in conclusion --Reference

Introduction

I'm addicted to sentiment analysis of text data in natural language processing. While doing that, I found it difficult to deal with net slang. So, this time, I decided to verify for myself whether the sentiment analysis service (hereinafter referred to as Natural Language API) in the Google Cloud Natural Language API supports net slang. (Caution) Just by verifying it yourself, it does not necessarily determine whether the Natural Language API supports net slang.

Evaluation method

Depending on the sentence, I thought that "laughing" could be replaced with "grass", so I will use it for evaluation. For example, the following statements have the same meaning.

・ 3rt4 likes laugh in 3 minutes ・ 3rt4 likes in 3 minutes is grass

When multiple sentences are prepared and evaluated using the Natural Language API, it is tested whether there is a difference in the average score between "laughing" and "grass".

procedure

The procedure is as follows.

Collect text data with twitter API
Preprocessing for text data
Score with the Natural Language API
Perform a test on the obtained score

Collect text data with twitter API

Since an application is required to use the twitter API, I applied while looking at [1]. The application passed in one day.

Now that the application has passed, we will get the text data. It is based on the code in [2]. Since there is preprocessing, I tried to write the acquired text data to a text file. I am searching with the search keyword "laughing".

import json
from requests_oauthlib import OAuth1Session

#OAuth authentication part
CK      = ""
CS      = ""
AT      = ""
ATS     = ""
twitter = OAuth1Session(CK, CS, AT, ATS)

url = 'https://api.twitter.com/1.1/search/tweets.json'

keyword = 'laugh'
params ={
         'count' : 100,      #Number of tweets to get
         'q'     : keyword,  #Search keyword
         }

f = open('./data/1/backup1.txt','w')

req = twitter.get(url, params = params)
print(req.status_code)
if req.status_code == 200:
    res = json.loads(req.text)
    for line in res['statuses']:
        print(line['text'])
        f.write(line['text'] + '\n')
        print('*******************************************')
else:
    print("Failed: %d" % req.status_code)

The search results are as follows.

・ Sure, I'm out of the hall, but sumo laughs ・ Because it's a place to laugh! Laugh! !! ・ What's that wwww laughing wwww

Preprocessing

Arrange the acquired text data. There are four tasks to be done here.

Removal of unnecessary character strings such as "RT" and "@XXXX"
Extract only lines with laughter from text data
Judgment of "laughing" replaced by "grass"
Create a sentence with "laughing" changed to "grass" and put it together in csv

1 and 2 are implemented as follows. 2 has a line break in the tweet, and I felt it was very difficult to do 3, so I removed it.

import re

readF = open('./data/1/backup1.txt','r')
writeF = open('./data/1/preprocessing1.txt','w')
lines = readF.readlines()
for line in lines:
    if 'laugh' in line:
        #Removal of "RT"
        line = re.sub('RT ', "", line)
        #Removal of "@XXXX" or "@XXXX"
        line = re.sub('(@\w*\W* )|(@\w*\W*)', "", line)
        writeF.write(line)
readF.close()
writeF.close()

3 was the hardest. ・ "Laughing" is at the end of the sentence ・ Kuten after "laughing" ・ "W" after "laughing" In such a case, I thought that I could replace it with "grass" with high probability, but I thought that the data would be biased. In the end, it was judged manually. Text data that we determined could not be replaced was removed.

The number of samples is now 200.

4 was implemented as follows.

import csv
import pandas as pd
count = 6
lines = []
for i in range(count):
    print(i)
    readF = open('./data/'+ str(i+1) + '/preprocessing' + str(i+1) + '.txt')
    lines += readF.readlines()

df = pd.DataFrame([],columns=['warau', 'kusa'])
replaceLines = []
for line in lines:
    replaceLines.append(line.replace('laugh', 'grass'))
df["warau"] = lines 
df["kusa"] = replaceLines
df.to_csv("./data/preprocessing.csv",index=False)

The result of the processing so far is as shown in the image below.

Google Cloud Natural Language API The sentiment analysis service in the Google Cloud Natural Language API returns the emotion score that the text has. The closer the emotion score is to 1, the more positive it is, and the closer it is to -1, the more negative it is [3]. In addition to sentiment analysis services, the Google Cloud Natural Language API also includes content classification.

The program was implemented based on [4]. Pass the "laughing" and "grass" sentences to the Natural Language API, and store the results in a List. Then add it to pandas with "warauResult" and "kusaResult" as column names. Finally, output the csv file.

from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
import os
import pandas as pd

credential_path = "/pass/xxx.json"
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path

client = language.LanguageServiceClient()

warauResultList = []
kusaResultList = []

df = pd.read_csv('./data/preprocessing.csv')
count = 9
for index,text in df.iterrows():
    #\removal of n
    text["warau"] = text["warau"].replace('\n', '')
    text["kusa"] = text["kusa"].replace('\n', '')
    
    #analysis of warau
    document = types.Document(
    content=text["warau"],
    type=enums.Document.Type.PLAIN_TEXT)
    sentiment = client.analyze_sentiment(document=document).document_sentiment
    warauResultList.append(sentiment.score)
    
    #kusa analysis
    document = types.Document(
    content=text["kusa"],
    type=enums.Document.Type.PLAIN_TEXT)
    sentiment = client.analyze_sentiment(document=document).document_sentiment
    kusaResultList.append(sentiment.score)
    
df["warauResult"] = warauResultList
df["kusaResult"] = kusaResultList

df.to_csv("./data/result.csv",index=False)

The result of the processing so far is as shown in the image below.

histogram

The histogram of warauResult is as follows.

The histogram of kusaResult is as follows.

Suppose each follows a normal distribution.

Test

Compare the value stored in warauResult with the value stored in kusaResult. This time, we will test the mean difference when there is a correspondence between the samples. I referred to [5] and [6].

・ Null hypothesis ・・・ The score did not change even if "laughing" was changed to "grass". ・ Alternative hypothesis ・・・ The score changed by changing "laughing" to "grass"

The program looks like this:

from scipy import stats
import pandas as pd
#Test of mean difference when there is a correspondence between samples
df = pd.read_csv('./data/result.csv')
stats.ttest_rel(df["warauResult"], df["kusaResult"])

The results are as follows. Ttest_relResult(statistic=3.0558408995373356, pvalue=0.0025520814940409413)

The reference for stats.ttest_rel is [7].

Quote: "If the p-value is smaller than the threshold, e.g. 1%, 5% or 10%, then we reject the null hypothesis of equal averages."

In other words, this time, the pvalue is as small as about 2.5%, so the null hypothesis is rejected. Therefore, changing "laughing" to "grass" will change the score. The specimen has only sentences with "laughing" that can be replaced with "grass" (subjective). However, the change in score concludes that the Natural Language API is not compatible with net slang.

Is the number of samples sufficient?

Average interval estimation is performed for each of warauResult and kusaResult. I referred to [8].

\begin{aligned}
\bar{X}-z_{\frac{\alpha}{2}}\sqrt{\frac{s^2}{n}} 
< \mu < 
\bar{X}+z_{\frac{\alpha}{2}}\sqrt{\frac{s^2}{n}}
\end{aligned}

The program looks like this:

from scipy import stats
import math

print("sample mean of warauResult",df['warauResult'].mean())

print("Sample mean of kusaResult",df['kusaResult'].mean())


#.var()Finds unbiased variance
print("WarauResult interval estimation",stats.norm.interval(alpha=0.95, 
                    loc=df['warauResult'].mean(), 
                    scale=math.sqrt(df['warauResult'].var() / len(df))))
print("Interval estimation of kusaResult",stats.norm.interval(alpha=0.95, 
                    loc=df['kusaResult'].mean(), 
                    scale=math.sqrt(df['kusaResult'].var() / len(df))))

The results are as follows. WarauResult sample mean 0.0014999993890523911 Sample mean of kusaResult -0.061000001728534696 Interval estimation of warauResult (-0.0630797610044764, 0.06607975978258118) Interval estimation of kusaResult (-0.11646731178466276, -0.005532691672406637)

Error range ・ WarauResult: Approximately ± 0.06458 ・ KusaResult: Approximately ± 0.05546

The range of emotional scores returned by the Natural Language API is 1 to -1. I thought that the error ± 0.06 in this range was small.

By the way, you can get the required number of samples based on the error range as shown in [9]. ・ About warauResult ・ Confidence coefficient 95% ・ Error range ± 0.06458 At this time, the number of samples is 200.

import numpy as np
#Since we do not know the standard deviation of the population, we substitute the square root of the unbiased variance.
rutoN = (1.96 *  np.sqrt(df['warauResult'].var()))/ 0.06458
N = rutoN * rutoN
print(N)

The results are as follows. 200.0058661538003

Improvement points

・ It is not objective because it is judged by one person whether it is "laughing" that can be replaced with "grass". → Evaluate with multiple people

・ The current method of collecting data cannot collect a large number of samples. → If you need a large number of samples, find a pattern and consider automation

・ How to determine the error range → I want a reason for what the error range should be

in conclusion

I would like to participate in the Advent Calendar next year as well.

reference

[1]https://qiita.com/kngsym2018/items/2524d21455aac111cdee [2]https://qiita.com/tomozo6/items/d7fac0f942f3c4c66daf [3]https://cloud.google.com/natural-language/docs/basics#interpreting_sentiment_analysis_values [4]https://cloud.google.com/natural-language/docs/quickstart-client-libraries#client-libraries-install-python [5]https://bellcurve.jp/statistics/course/9453.html [6]https://ohke.hateblo.jp/entry/2018/05/19/230000 [7]https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_rel.html [8]https://ohke.hateblo.jp/entry/2018/05/12/230000 [9]https://toukeigaku-jouhou.info/2018/01/23/how-to-calculate-samplesize/