import pandas as pd
pd.set_option('display.unicode.east_asian_width', True)
#Reading the emotion value dictionary
pndic = pd.read_csv(r"http://www.lr.pi.titech.ac.jp/~takamura/pubs/pn_ja.dic",
encoding="shift-jis",
names=['word_type_score'])
print(pndic)
set_option ()
of pandas specifies various options such as display format, and the argument 'display.unicode.east_asian_width'
is displayed with the column name and value aligned in consideration of double-byte characters. I will.split ()
to divide the column into four with ":" as the delimiter, extract the word (termination form) " word "
and the emotion value " score "
, and convert them to dict type.import numpy as np
#Extract word and emotion values
pndic["split"] = pndic["word_type_score"].str.split(":")
pndic["word"] = pndic["split"].str.get(0)
pndic["score"] = pndic["split"].str.get(3)
#Convert to dict type
keys = pndic['word'].tolist()
values = pndic['score'].tolist()
dic = dict(zip(keys, values))
print(dic)
text = 'The nationwide import volume of spaghetti reached a record high by October, and customs suspects that the background is the so-called "needing demand" that has increased due to the spread of the new coronavirus infection. According to Yokohama Customs, the amount of spaghetti imported from ports and airports nationwide was approximately 142,000 tons as of the end of October. This was a record high, exceeding the import volume of one year three years ago by about 4000 tons. In addition, macaroni also had an import volume of more than 11,000 tons by October, which is almost the same as the import volume of one year four years ago, which was the highest ever.'
lines = text.split("。")
!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7
" -Ochasen "
.import MeCab
mecab = MeCab.Tagger("-Ochasen")
#Illustrate the results of morphological analysis on the first line
print(mecab.parse(lines[0]))
#Extract words based on morphological analysis
word_list = []
for l in lines:
temp = []
for v in mecab.parse(l).splitlines():
if len(v.split()) >= 3:
if v.split()[3][:2] in ['noun','adjective','verb','adverb']:
temp.append(v.split()[2])
word_list.append(temp)
#Remove empty element
word_list = [x for x in word_list if x != []]
print(word_list)
result = []
#Sentence-based processing
for sentence in word_list:
temp = []
#Word-based processing
for word in sentence:
word_score = []
score = dic.get(word)
word_score = (word, score)
temp.append(word_score)
result.append(temp)
#Display as a data frame for each sentence
for i in range(len(result)):
print(lines[i], '\n', pd.DataFrame(result[i], columns=["word", "score"]), '\n')
The results of all 4 lines are summarized in the table below. From left to right, the words on each line and their emotional polarity values.
First of all, None is a word that is not registered in the "Word Emotion Polarity Correspondence Table", and it can be said that it is a problem that always occurs even with a dictionary with a large number of recorded vocabularies.
However, the problem is that there is only one word that has a positive emotional polarity value, "the most", and all others have negative values. Many of them do not realize why they are judged to be negative.
#Calculate the average value for each sentence
mean_list = []
for i in result:
temp = []
for j in i:
if not j[1] == None:
temp.append(float(j[1]))
mean = (sum(temp) / len(temp))
mean_list.append(mean)
#Display as a data frame
print(pd.DataFrame(mean_list, columns=["mean"], index=lines[0:4]))
#Number of positive words
keys_pos = [k for k, v in dic.items() if float(v) > 0]
cnt_pos = len(keys_pos)
#Number of negative words
keys_neg = [k for k, v in dic.items() if float(v) < 0]
cnt_neg = len(keys_neg)
#Neutral word count
keys_neu = [k for k, v in dic.items() if float(v) == 0]
cnt_neu = len(keys_neu)
print("Percentage of positives:", ('{:.3f}'.format(cnt_pos / len(dic))), "(", cnt_pos, "word)")
print("Percentage of negatives:", ('{:.3f}'.format(cnt_neg / len(dic))), "(", cnt_neg, "word)")
print("Neutral percentage:", ('{:.3f}'.format(cnt_neu / len(dic))), "(", cnt_neu, "word)")
print("Number of elements before conversion to dict type:", len(pndic))
print("Number of elements after conversion to dict type:", len(dic), "\n")
pndic_list = pndic["word"].tolist()
print("Unique number of elements before conversion to dict type:", len(set(pndic_list)))
word (termination form): reading: part of speech: emotion value [-1, +1]
, but for example, the same word (termination form)
can be used as follows. Multiple data are scattered. Counting the unique number of elements without such duplication matches the number of elements converted to dict type 52,671 as described above.Recommended Posts