Since the advent calendar for natural language processing was not filled, I registered a program in pypl (pip) to remove stop words from the list of documents that have morphologically analyzed words that I recently wrote and used as a list. I would like to share it with you.

Introduction

Perhaps there is already a useful library, but scikit-learn, which I usually use, has a function to remove English stopwords, but not Japanese.

So I always implement it myself, but I thought it would be easier if I registered with pypl and installed it with pip.

No, it's not a big code, so I should write it each time, but since it will be more than a dozen lines, I thought that some people might use it if I published it on pip.

Since it is a code that normally expands the list with python and deletes it against the list of stop words, there is no function that makes the execution speed faster.

However, since I had a little free time today, I installed a function to change the contents of the stop word list for each part of speech.

Because, I think it's better to change the part of speech of the word to be deleted depending on the characteristics of the sentence.

For example, I think that SNS posts and news site articles have slightly different roles for words for each part of speech.

However, I feel that changing the contents of the stop word does not significantly affect the accuracy of the model.

The words for each part of speech to be deleted were selected by referring to wikipedia, Mielka AI-Consideration of Japanese Stopwords [By Part of Speech], and slothlib (30 for each part of speech). It feels like ~ 50).

How to use

You can install it with pip. The only library it depends on is scikit-learn.

pip install ja_stopword_remover

All you have to do is import normally.

from ja_stopword_remover.remover import StopwordRemover

Note that ** "If you throw a list of sentences that have morphologically parsed words as a list, the stopwords will be deleted and returned" **.

If you prepare an instance from the StopwordRemover class and call it with a list of words in the argument of theremove ()method, the list of results will be returned.

Sample code


from ja_stopword_remover.remover import StopwordRemover
import pprint

from ja_stopword_remover.remover import StopwordRemover
import pprint

#Tada(@ohta_nano)It is a poem of Mr.
text_list = [[ "I", "Etc.", "Is", "Planetarium", "To", "Standing basket", "breaking Dawn", "of", "scene", "Only", "repeat",],
    [ "Cherry Blossoms", "What", "「", "Cherry Blossoms", "」", "What", "Read", "What", "you", "From", "tell me", "get", "Man", "To", "want to become",],]

stopwordRemover = StopwordRemover()

text_list_result = stopwordRemover.remove(text_list)
pprint.pprint(text_list_result)

stopwordRemover = StopwordRemover()

text_list_result = stopwordRemover.remove(text_list)
pprint.pprint(text_list_result)

If you want to remove the stopword from just one sentence, list that sentence further.

Specify the part of speech to delete

If you want to specify the part of speech, specify it with the argument of test_choose_parts ().

Parameter name	Part of speech name
demonstrative	Demonstrative
pronoun	That's the word
symbol	symbol
verb	verb
one_character	One character
postpositional_particle	Particle
adjective	adjective
auxiliary_verb	Auxiliary verb
slothlib	slothlib words

It's like this. In this case, only the words recorded in slothlib are deleted.

    stopwordRemover.choose_parts(
        demonstrative=False,
        symbol=False,
        verb=False,
        one_character=False,
        postpositional_particle=False,
        slothlib=True,
        auxiliary_verb=False,
        adjective=False
    )

If you do not use test_choose_parts (), all part-speech words are erased by default. However, the default value of the argument of test_choose_parts () is False, so if you want to delete only the words recorded in slothlib, test_choose_parts (slothlib = True) Is OK.

When used in the scikit-learn pipeline

If you want to use it in the scikit-learn pipeline, please use the SKStopwordRemover class.

This is also simple to use, just register the instance in step.

    sKKStopwordRemover = SKStopwordRemover()

    step = [("StopwordRemover", sKKStopwordRemover)]

    pipe = Pipeline(steps=step)

    pipe.fit(text_list)

    text_list_result = pipe.transform(text_list)

When specifying a part of speech, specify the part of speech that you do not want to delete in the argument when creating an instance SKStopwordRemover (one_character = False)

The default here is True, but the default is False because slothlib contains all part of speech (sorry for being confusing).

At the end

I think language and machine learning are very interesting, but learning isn't progressing.

Both public and private, I want to analyze and generate sentences with crunchy AI, but I'm only studying.