Since the advent calendar for natural language processing was not filled, I registered a program in pypl (pip) to remove stop words from the list of documents that have morphologically analyzed words that I recently wrote and used as a list. I would like to share it with you.
Perhaps there is already a useful library, but scikit-learn, which I usually use, has a function to remove English stopwords, but not Japanese.
So I always implement it myself, but I thought it would be easier if I registered with pypl and installed it with pip.
No, it's not a big code, so I should write it each time, but since it will be more than a dozen lines, I thought that some people might use it if I published it on pip.
Since it is a code that normally expands the list with python and deletes it against the list of stop words, there is no function that makes the execution speed faster.
However, since I had a little free time today, I installed a function to change the contents of the stop word list for each part of speech.
Because, I think it's better to change the part of speech of the word to be deleted depending on the characteristics of the sentence.
For example, I think that SNS posts and news site articles have slightly different roles for words for each part of speech.
However, I feel that changing the contents of the stop word does not significantly affect the accuracy of the model.
The words for each part of speech to be deleted were selected by referring to wikipedia, Mielka AI-Consideration of Japanese Stopwords [By Part of Speech], and slothlib (30 for each part of speech). It feels like ~ 50).
You can install it with pip. The only library it depends on is scikit-learn.
pip install ja_stopword_remover
All you have to do is import normally.
from ja_stopword_remover.remover import StopwordRemover
Note that ** "If you throw a list of sentences that have morphologically parsed words as a list, the stopwords will be deleted and returned" **.
If you prepare an instance from the StopwordRemover
class and call it with a list of words in the argument of theremove ()
method, the list of results will be returned.
from ja_stopword_remover.remover import StopwordRemover
import pprint
from ja_stopword_remover.remover import StopwordRemover
import pprint
#Tada(@ohta_nano)It is a poem of Mr.
text_list = [[ "I", "Etc.", "Is", "Planetarium", "To", "Standing basket", "breaking Dawn", "of", "scene", "Only", "repeat",],
[ "Cherry Blossoms", "What", "「", "Cherry Blossoms", "」", "What", "Read", "What", "you", "From", "tell me", "get", "Man", "To", "want to become",],]
stopwordRemover = StopwordRemover()
text_list_result = stopwordRemover.remove(text_list)
pprint.pprint(text_list_result)
stopwordRemover = StopwordRemover()
text_list_result = stopwordRemover.remove(text_list)
pprint.pprint(text_list_result)
If you want to remove the stopword from just one sentence, list that sentence further.
If you want to specify the part of speech, specify it with the argument of test_choose_parts ()
.
Parameter name | Part of speech name |
---|---|
demonstrative | Demonstrative |
pronoun | That's the word |
symbol | symbol |
verb | verb |
one_character | One character |
postpositional_particle | Particle |
adjective | adjective |
auxiliary_verb | Auxiliary verb |
slothlib | slothlib words |
It's like this. In this case, only the words recorded in slothlib are deleted.
stopwordRemover.choose_parts(
demonstrative=False,
symbol=False,
verb=False,
one_character=False,
postpositional_particle=False,
slothlib=True,
auxiliary_verb=False,
adjective=False
)
If you do not use test_choose_parts ()
, all part-speech words are erased by default. However, the default value of the argument of test_choose_parts ()
is False, so if you want to delete only the words recorded in slothlib, test_choose_parts (slothlib = True)
Is OK.
If you want to use it in the scikit-learn pipeline, please use the SKStopwordRemover
class.
This is also simple to use, just register the instance in step
.
sKKStopwordRemover = SKStopwordRemover()
step = [("StopwordRemover", sKKStopwordRemover)]
pipe = Pipeline(steps=step)
pipe.fit(text_list)
text_list_result = pipe.transform(text_list)
When specifying a part of speech, specify the part of speech that you do not want to delete in the argument when creating an instance SKStopwordRemover (one_character = False)
I think language and machine learning are very interesting, but learning isn't progressing.
Both public and private, I want to analyze and generate sentences with crunchy AI, but I'm only studying.
Recommended Posts