[PYTHON] Created a Chrome extension that uses the power of natural language processing to drive dark sites out of the world

Motivation

Coronavirus pandemic and dark news these days ...: severe: → If you only see the bright news, you should feel positive!

What I made

A Chrome extension that makes Google search results harder to see the darker the content. Chrome Store: Opty github

Actual screen

Dark site Bright site
Screenshot 2020-04-22 11.01.17.png Screenshot 2020-04-22 11.03.22.png

How to use

Just install the Chrome extension from here and search! Reference: How to install the extension

environment

Personnel: 2 university students (Tomohiro Inoue, Takeshi Watanabe) Production period: 1 day Cloud Functions Python 3.7 JavaScript MeCab

System configuration

図.png

Brightness judgment

Break down a sentence into words and judge whether each element is bright or dark.

MeCab is used to decompose sentences into words. For example

morph.py


import MeCab

tagger = MeCab.Tagger()
result = tagger.parse('The new coronavirus is outbreak worldwide.')
print(result)

When you execute the process

New noun,General,*,*,*,*,New model,Singata,Singata
Corona noun,General,*,*,*,*,corona,corona,corona
Virus noun,General,*,*,*,*,virus,virus,virus
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
World noun,General,*,*,*,*,world,Sekai,Sekai
Noun,suffix,Adjectival noun stem,*,*,*,Target,Text,Text
Particles,Adverbization,*,*,*,*,To,D,D
Large prefix,Noun connection,*,*,*,*,Big,Die,Die
Trendy noun,Change connection,*,*,*,*,trend,Ryukou,Ryuko
Verb,Independence,*,*,Sahen Suru,Continuous form,To do,Shi,Shi
Particles,Connection particle,*,*,*,*,hand,Te,Te
Verb,Non-independent,*,*,One step,Continuous form,Is,I,I
Auxiliary verb,*,*,*,Special / mass,Uninflected word,Masu,trout,trout
.. symbol,Kuten,*,*,*,*,。,。,。
EOS

It is disassembled like. Extract the unused form of each word from the decomposed result and use it.

[Japanese evaluation polarity dictionary](http://www.cl.ecei.tohoku.ac.jp/index.php?Open] published by Inui Suzuki Laboratory of Tohoku University for determining the brightness of elements % 20Resources% 2FJapanese% 20Sentiment% 20Polarity% 20Dictionary) has been used.

In this dictionary, Japanese words are classified into positive (bright) or negative (dark), and nouns are classified into three levels: p (positive), n (negative), and e (neither).

wago.121808.pn


Negative (experience)
Negative (experience) give up
Negative (experience) Akiruno
Negative (experience)
Negative (experience)

pn.csv.m3.120408.trim


Thank you p ~ There is / enhances (existence / nature)
Thank you p ~ There is / enhances (existence / nature)
Thank you annoyance n ~ becomes / becomes (evaluation / emotion) subjective
Being e ~ as it is (evaluation / emotion) Subjectivity
Being e ~ as it is (evaluation / emotion) Subjectivity

Each component of the sentence is replaced with 1 if it is positive, -1 if it is negative, and 0 otherwise, and the average is taken as the brightness of the sentence.

Below are the points I stumbled upon during development.

Stumble point 1: Polarity dictionary ≠ Brightness for each word

I was easily thinking that I should divide the sentence into words and search the polarity dictionary with the word as the key, but it was not that simple. In the polarity dictionary, not only a single word such as "good" but also an element consisting of two or more words (in this case, "good" + "not") such as "not good" are registered. Therefore, I reorganized the dictionary so that it can be searched by elements consisting of multiple words, and made it pickle before using it.

main.py


    #Store in dictionary
    for line in pn_noun_file:
        line = line.replace('\n', '').split('\t')
        if line[1] == 'e': #Ignore lines that are neither positive nor negative
            continue
        #A list of words registered in the polarity dictionary converted into basic forms
        basic_form = convert_to_basic_form(line[0]) 
        #Ignore lines for which the basic form cannot be obtained and lines for which the basic form is one character
        if not basic_form:
            continue
        elif len(basic_form) == 1 and len(basic_form[0]) == 1:
            continue
        key = basic_form[0]
        if key not in pn_dict:
            pn_dict[key] = {}
        #Stored as a combination of brightness and a combination of basic shapes
        pn_dict[key][(',').join(basic_form)] = 1 if line[1] == 'p' else - 1

Stumbling point 2: Not fun = Positive?

While "bad" is registered as a negative element in the polarity dictionary, "not fun" is not registered, so it is judged as a positive word by reacting only to the "fun" part. It was. Therefore, if there is no such thing, the brightness value of the previous part is inverted.

main.py


#PN judgment. Returns the average PN value of the element being requested.
def calc_pn(basic_form):
    pn_dict = pickle.load(open('pn.pkl', 'rb'))
    pn_values = [] #Stores the PN judgment value of each element in the text

    while basic_form:
        pn_value = 0
        del_num = 1  #Number to remove from list
        beginning = basic_form[0] #Set the first word to key

        if beginning in pn_dict:
            for index, word in enumerate(basic_form):
                if word == "。" or word == "、":  #If the sentence breaks, stop
                    break
                if index == 0:
                    joined_basic_forms = beginning
                else:
                    joined_basic_forms += ',' + word

                if word == "Absent" and del_num == index: #Positive negative reversal required
                    print('reverse')
                    pn_value *= -1
                    del_num = index + 1

                if joined_basic_forms in pn_dict[beginning]:
                    pn_value = pn_dict[beginning][joined_basic_forms]
                    del_num = index + 1

        pn_values.append(pn_value)
        del basic_form[0:del_num]

    return sum(pn_values) / len(pn_values)

Improvement points

--Slow: Currently, it takes about 3 seconds from the search result display to the style reflection. --Bright page decoration: I want to make it stand out the brighter it is.

Finally

I had some time to refrain from corona, so I made it as a study during the spring break. Others are under development! If you like, please follow LGTM and Twitter!

Extension: Opty Twitter: Tomohiro Inoue, Takeshi Watanabe

References

  1. I made a Gem to judge negative / positive -[Python] Accelerate the loading of time series CSV

Recommended Posts

Created a Chrome extension that uses the power of natural language processing to drive dark sites out of the world
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
Types of preprocessing in natural language processing and their power
How to write a test for processing that uses BigQuery
A story that struggled to handle the Python package of PocketSphinx
[Go] Create a CLI command to change the extension of the image
A function that measures the processing time of a method in python
[Word2vec] Let's visualize the result of natural language processing of company reviews
[python] A note that started to understand the behavior of matplotlib.pyplot
[Python] A program that rotates the contents of the list to the left
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner