[PYTHON] I tried to verify whether the Natural Language API (sentiment analysis) supports net slang.

Overview

--Introduction

Introduction

I'm addicted to sentiment analysis of text data in natural language processing. While doing that, I found it difficult to deal with net slang. So, this time, I decided to verify for myself whether the sentiment analysis service (hereinafter referred to as Natural Language API) in the Google Cloud Natural Language API supports net slang. (Caution) Just by verifying it yourself, it does not necessarily determine whether the Natural Language API supports net slang.

Evaluation method

Depending on the sentence, I thought that "laughing" could be replaced with "grass", so I will use it for evaluation. For example, the following statements have the same meaning.

・ 3rt4 likes laugh in 3 minutes ・ 3rt4 likes in 3 minutes is grass

When multiple sentences are prepared and evaluated using the Natural Language API, it is tested whether there is a difference in the average score between "laughing" and "grass".

procedure

The procedure is as follows.

  1. Collect text data with twitter API
  2. Preprocessing for text data
  3. Score with the Natural Language API
  4. Perform a test on the obtained score

Collect text data with twitter API

Since an application is required to use the twitter API, I applied while looking at [1]. The application passed in one day.

Now that the application has passed, we will get the text data. It is based on the code in [2]. Since there is preprocessing, I tried to write the acquired text data to a text file. I am searching with the search keyword "laughing".

import json
from requests_oauthlib import OAuth1Session

#OAuth authentication part
CK      = ""
CS      = ""
AT      = ""
ATS     = ""
twitter = OAuth1Session(CK, CS, AT, ATS)

url = 'https://api.twitter.com/1.1/search/tweets.json'

keyword = 'laugh'
params ={
         'count' : 100,      #Number of tweets to get
         'q'     : keyword,  #Search keyword
         }

f = open('./data/1/backup1.txt','w')

req = twitter.get(url, params = params)
print(req.status_code)
if req.status_code == 200:
    res = json.loads(req.text)
    for line in res['statuses']:
        print(line['text'])
        f.write(line['text'] + '\n')
        print('*******************************************')
else:
    print("Failed: %d" % req.status_code)

The search results are as follows.

・ Sure, I'm out of the hall, but sumo laughs ・ Because it's a place to laugh! Laugh! !! ・ What's that wwww laughing wwww

Preprocessing

Arrange the acquired text data. There are four tasks to be done here.

  1. Removal of unnecessary character strings such as "RT" and "@XXXX"
  2. Extract only lines with laughter from text data
  3. Judgment of "laughing" replaced by "grass"
  4. Create a sentence with "laughing" changed to "grass" and put it together in csv

1 and 2 are implemented as follows. 2 has a line break in the tweet, and I felt it was very difficult to do 3, so I removed it.

import re

readF = open('./data/1/backup1.txt','r')
writeF = open('./data/1/preprocessing1.txt','w')
lines = readF.readlines()
for line in lines:
    if 'laugh' in line:
        #Removal of "RT"
        line = re.sub('RT ', "", line)
        #Removal of "@XXXX" or "@XXXX"
        line = re.sub('(@\w*\W* )|(@\w*\W*)', "", line)
        writeF.write(line)
readF.close()
writeF.close()

3 was the hardest. ・ "Laughing" is at the end of the sentence ・ Kuten after "laughing" ・ "W" after "laughing" In such a case, I thought that I could replace it with "grass" with high probability, but I thought that the data would be biased. In the end, it was judged manually. Text data that we determined could not be replaced was removed.

The number of samples is now 200.

4 was implemented as follows.

import csv
import pandas as pd
count = 6
lines = []
for i in range(count):
    print(i)
    readF = open('./data/'+ str(i+1) + '/preprocessing' + str(i+1) + '.txt')
    lines += readF.readlines()

df = pd.DataFrame([],columns=['warau', 'kusa'])
replaceLines = []
for line in lines:
    replaceLines.append(line.replace('laugh', 'grass'))
df["warau"] = lines 
df["kusa"] = replaceLines
df.to_csv("./data/preprocessing.csv",index=False)

The result of the processing so far is as shown in the image below. img1.png

Google Cloud Natural Language API The sentiment analysis service in the Google Cloud Natural Language API returns the emotion score that the text has. The closer the emotion score is to 1, the more positive it is, and the closer it is to -1, the more negative it is [3]. In addition to sentiment analysis services, the Google Cloud Natural Language API also includes content classification.

The program was implemented based on [4]. Pass the "laughing" and "grass" sentences to the Natural Language API, and store the results in a List. Then add it to pandas with "warauResult" and "kusaResult" as column names. Finally, output the csv file.

from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
import os
import pandas as pd

credential_path = "/pass/xxx.json"
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path

client = language.LanguageServiceClient()

warauResultList = []
kusaResultList = []

df = pd.read_csv('./data/preprocessing.csv')
count = 9
for index,text in df.iterrows():
    #\removal of n
    text["warau"] = text["warau"].replace('\n', '')
    text["kusa"] = text["kusa"].replace('\n', '')
    
    #analysis of warau
    document = types.Document(
    content=text["warau"],
    type=enums.Document.Type.PLAIN_TEXT)
    sentiment = client.analyze_sentiment(document=document).document_sentiment
    warauResultList.append(sentiment.score)
    
    #kusa analysis
    document = types.Document(
    content=text["kusa"],
    type=enums.Document.Type.PLAIN_TEXT)
    sentiment = client.analyze_sentiment(document=document).document_sentiment
    kusaResultList.append(sentiment.score)
    
df["warauResult"] = warauResultList
df["kusaResult"] = kusaResultList

df.to_csv("./data/result.csv",index=False)

The result of the processing so far is as shown in the image below. img2.png

histogram

The histogram of warauResult is as follows. warauResult.png

The histogram of kusaResult is as follows. kusaResult.png

Suppose each follows a normal distribution.

Test

Compare the value stored in warauResult with the value stored in kusaResult. This time, we will test the mean difference when there is a correspondence between the samples. I referred to [5] and [6].

・ Null hypothesis ・ ・ ・ The score did not change even if "laughing" was changed to "grass". ・ Alternative hypothesis ・ ・ ・ The score changed by changing "laughing" to "grass"

The program looks like this:

from scipy import stats
import pandas as pd
#Test of mean difference when there is a correspondence between samples
df = pd.read_csv('./data/result.csv')
stats.ttest_rel(df["warauResult"], df["kusaResult"])

The results are as follows. Ttest_relResult(statistic=3.0558408995373356, pvalue=0.0025520814940409413)

The reference for stats.ttest_rel is [7].

Quote: "If the p-value is smaller than the threshold, e.g. 1%, 5% or 10%, then we reject the null hypothesis of equal averages."

In other words, this time, the pvalue is as small as about 2.5%, so the null hypothesis is rejected. Therefore, changing "laughing" to "grass" will change the score. The specimen has only sentences with "laughing" that can be replaced with "grass" (subjective). However, the change in score concludes that the Natural Language API is not compatible with net slang.

Is the number of samples sufficient?

Average interval estimation is performed for each of warauResult and kusaResult. I referred to [8].

\begin{aligned}
\bar{X}-z_{\frac{\alpha}{2}}\sqrt{\frac{s^2}{n}} 
< \mu < 
\bar{X}+z_{\frac{\alpha}{2}}\sqrt{\frac{s^2}{n}}
\end{aligned}

The program looks like this:

from scipy import stats
import math

print("sample mean of warauResult",df['warauResult'].mean())

print("Sample mean of kusaResult",df['kusaResult'].mean())


#.var()Finds unbiased variance
print("WarauResult interval estimation",stats.norm.interval(alpha=0.95, 
                    loc=df['warauResult'].mean(), 
                    scale=math.sqrt(df['warauResult'].var() / len(df))))
print("Interval estimation of kusaResult",stats.norm.interval(alpha=0.95, 
                    loc=df['kusaResult'].mean(), 
                    scale=math.sqrt(df['kusaResult'].var() / len(df))))

The results are as follows. WarauResult sample mean 0.0014999993890523911 Sample mean of kusaResult -0.061000001728534696 Interval estimation of warauResult (-0.0630797610044764, 0.06607975978258118) Interval estimation of kusaResult (-0.11646731178466276, -0.005532691672406637)

Error range ・ WarauResult: Approximately ± 0.06458 ・ KusaResult: Approximately ± 0.05546

The range of emotional scores returned by the Natural Language API is 1 to -1. I thought that the error ± 0.06 in this range was small.

By the way, you can get the required number of samples based on the error range as shown in [9]. ・ About warauResult ・ Confidence coefficient 95% ・ Error range ± 0.06458 At this time, the number of samples is 200.

import numpy as np
#Since we do not know the standard deviation of the population, we substitute the square root of the unbiased variance.
rutoN = (1.96 *  np.sqrt(df['warauResult'].var()))/ 0.06458
N = rutoN * rutoN
print(N)

The results are as follows. 200.0058661538003

Improvement points

・ It is not objective because it is judged by one person whether it is "laughing" that can be replaced with "grass". → Evaluate with multiple people

・ The current method of collecting data cannot collect a large number of samples. → If you need a large number of samples, find a pattern and consider automation

・ How to determine the error range → I want a reason for what the error range should be

in conclusion

I would like to participate in the Advent Calendar next year as well.

reference

[1]https://qiita.com/kngsym2018/items/2524d21455aac111cdee [2]https://qiita.com/tomozo6/items/d7fac0f942f3c4c66daf [3]https://cloud.google.com/natural-language/docs/basics#interpreting_sentiment_analysis_values [4]https://cloud.google.com/natural-language/docs/quickstart-client-libraries#client-libraries-install-python [5]https://bellcurve.jp/statistics/course/9453.html [6]https://ohke.hateblo.jp/entry/2018/05/19/230000 [7]https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_rel.html [8]https://ohke.hateblo.jp/entry/2018/05/12/230000 [9]https://toukeigaku-jouhou.info/2018/01/23/how-to-calculate-samplesize/

Recommended Posts

I tried to verify whether the Natural Language API (sentiment analysis) supports net slang.
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
I tried to touch the COTOHA API
I tried to touch the API of ebay
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
I tried to predict the J-League match (data analysis)
I tried to identify the language using CNN + Melspectogram
I tried to illustrate the time and time in C language
I tried to get various information from the codeforces API
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
I tried to create Quip API
I tried the Naro novel API 2
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 1
I tried to touch Tesla's API
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 2
I tried the Naruro novel API
I tried to move the ball
I tried using the checkio API
I tried to estimate the interval.
From the introduction of GoogleCloudPlatform Natural Language API to how to use it
I tried to verify the best way to find a good marriage partner
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
I tried to get the authentication code of Qiita API with Python.
I tried to summarize various sentences using the automatic summarization API "summpy"
I tried to verify and analyze the acceleration of Python by Cython
[Linux] I tried to verify the secure confirmation method of FQDN (CentOS7)
I tried to get the movie information of TMDb API with Python
I tried to verify the result of A / B test by chi-square test
I tried to summarize the umask command
I tried to recognize the wake word
I tried to summarize the graphical modeling.
I tried to estimate the pi stochastically
I tried to make a Web API
I tried natural language processing with transformers.
I tried using the BigQuery Storage API
Let the COTOHA API do the difficult things-Introduction to "learn using" natural language processing-
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
I tried to verify the Big Bang theorem [Is it about to come back?]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
I tried to program bubble sort by language
I tried web scraping to analyze the lyrics.
I tried cluster analysis of the weather map
I tried hitting the Qiita API from go
I tried to optimize while drying the laundry
I tried to save the data with discord
Before the coronavirus, I first tried SARS analysis
The first artificial intelligence. I wanted to try natural language processing, so I will try morphological analysis using MeCab with python3.
Qiita Job I tried to analyze the job offer
LeetCode I tried to summarize the simple ones
I tried to implement the traveling salesman problem
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to classify Mr. Habu and Mr. Habu with natural language processing × naive Bayes classifier
I tried to scrape YouTube, but I can use the API, so don't do it.
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
[First data science ⑤] I tried to help my friend find the first property by data analysis.
[For beginners] Language analysis using the natural language processing tool "GiNZA" (from morphological analysis to vectorization)
I tried to score the syntax that was too humorous and humorous using the COTOHA API.
I tried to learn the sin function with chainer