[PYTHON] I uploaded a module to pypl that deletes Japanese stop words, so share it

Since the advent calendar for natural language processing was not filled, I registered a program in pypl (pip) to remove stop words from the list of documents that have morphologically analyzed words that I recently wrote and used as a list. I would like to share it with you.

Introduction

Perhaps there is already a useful library, but scikit-learn, which I usually use, has a function to remove English stopwords, but not Japanese.

So I always implement it myself, but I thought it would be easier if I registered with pypl and installed it with pip.

No, it's not a big code, so I should write it each time, but since it will be more than a dozen lines, I thought that some people might use it if I published it on pip.

Since it is a code that normally expands the list with python and deletes it against the list of stop words, there is no function that makes the execution speed faster.

However, since I had a little free time today, I installed a function to change the contents of the stop word list for each part of speech.

Because, I think it's better to change the part of speech of the word to be deleted depending on the characteristics of the sentence.

For example, I think that SNS posts and news site articles have slightly different roles for words for each part of speech.

However, I feel that changing the contents of the stop word does not significantly affect the accuracy of the model.

The words for each part of speech to be deleted were selected by referring to wikipedia, Mielka AI-Consideration of Japanese Stopwords [By Part of Speech], and slothlib (30 for each part of speech). It feels like ~ 50).

How to use

You can install it with pip. The only library it depends on is scikit-learn.

pip install ja_stopword_remover

All you have to do is import normally.

from ja_stopword_remover.remover import StopwordRemover

Note that ** "If you throw a list of sentences that have morphologically parsed words as a list, the stopwords will be deleted and returned" **.

If you prepare an instance from the StopwordRemover class and call it with a list of words in the argument of theremove ()method, the list of results will be returned.

Sample code


from ja_stopword_remover.remover import StopwordRemover
import pprint

from ja_stopword_remover.remover import StopwordRemover
import pprint

#Tada(@ohta_nano)It is a poem of Mr.
text_list = [[ "I", "Etc.", "Is", "Planetarium", "To", "Standing basket", "breaking Dawn", "of", "scene", "Only", "repeat",],
    [ "Cherry Blossoms", "What", "「", "Cherry Blossoms", "」", "What", "Read", "What", "you", "From", "tell me", "get", "Man", "To", "want to become",],]

stopwordRemover = StopwordRemover()

text_list_result = stopwordRemover.remove(text_list)
pprint.pprint(text_list_result)

stopwordRemover = StopwordRemover()

text_list_result = stopwordRemover.remove(text_list)
pprint.pprint(text_list_result)

If you want to remove the stopword from just one sentence, list that sentence further.

Specify the part of speech to delete

If you want to specify the part of speech, specify it with the argument of test_choose_parts ().

Parameter name Part of speech name
demonstrative Demonstrative
pronoun That's the word
symbol symbol
verb verb
one_character One character
postpositional_particle Particle
adjective adjective
auxiliary_verb Auxiliary verb
slothlib slothlib words

It's like this. In this case, only the words recorded in slothlib are deleted.

    stopwordRemover.choose_parts(
        demonstrative=False,
        symbol=False,
        verb=False,
        one_character=False,
        postpositional_particle=False,
        slothlib=True,
        auxiliary_verb=False,
        adjective=False
    )

If you do not use test_choose_parts (), all part-speech words are erased by default. However, the default value of the argument of test_choose_parts () is False, so if you want to delete only the words recorded in slothlib, test_choose_parts (slothlib = True) Is OK.

When used in the scikit-learn pipeline

If you want to use it in the scikit-learn pipeline, please use the SKStopwordRemover class.

This is also simple to use, just register the instance in step.

    sKKStopwordRemover = SKStopwordRemover()

    step = [("StopwordRemover", sKKStopwordRemover)]

    pipe = Pipeline(steps=step)

    pipe.fit(text_list)

    text_list_result = pipe.transform(text_list)

When specifying a part of speech, specify the part of speech that you do not want to delete in the argument when creating an instance SKStopwordRemover (one_character = False)

At the end

I think language and machine learning are very interesting, but learning isn't progressing.

Both public and private, I want to analyze and generate sentences with crunchy AI, but I'm only studying.

Recommended Posts

I uploaded a module to pypl that deletes Japanese stop words, so share it
I tried to publish my own module so that I can pip install it
I tried to make a calculator with Tkinter so I will write it
Github Interesting Repository ① ~ I found a graphic repository that looks interesting, so I tried it ~
I wanted to use the find module of Ansible2, but it took some time, so make a note
I set up TensowFlow and was addicted to it, so make a note
How to enter PyCharm's autocomplete feature so that it doesn't overwrite the following words
A module that makes it easier to write Perl-like filter programs in Python fileinput
A Python beginner made a chat bot, so I tried to summarize how to make it.
It's Cat Day, so I tried to make something that translates into cat-like words.
I realized that it is nonsense to use the module without thinking because it is convenient.
I made a library to separate Japanese sentences nicely
I made a Python module to translate comment outs
A story that I was addicted to at np.where
I'm always impatient when ordering a cafe, so I made a React app to solve it.
I made a tool that makes it a little easier to create and install a public key.
I was so mushy that I wanted to have a national plane heal me. I have no regrets.
I wrote a miscellaneous Ansible module that enables Virtualenv to be used by installing Pythonz.
I made a function to crop the image of python openCV, so please use it.
I made a tool that makes it convenient to set parameters for machine learning models.
[Python] I wrote a test of "Streamlit" that makes it easy to create visualization applications.
There was a doppelganger, so I tried to distinguish it with artificial intelligence (laughs) (Part 1)