-Introduced because I made Package for Wikification --What is Wikification? ――What are the advantages of Wikification?
In a nutshell, it's the task of associating words in the text with Wikipedia articles.
As an example, let's say you have a sentence like this:
`As of 2016, there are five manufacturers participating, Yamaha, Honda, Suzuki, Ducati, and Aprilia, and a satellite team that can receive works machine loans. ``
This is the [Road Race Championship](https://ja.wikipedia.org/wiki/%E3%83%AD%E3%83%BC%E3%83%89%E3%83%AC%E3%83%BC It is a passage from% E3% 82% B9% E4% B8% 96% E7% 95% 8C% E9% 81% B8% E6% 89% 8B% E6% A8% A9).
At this time, the Suzuki
that exists in the text is [Suzuki (company)](https://ja.wikipedia.org/wiki/%E3%82%B9%E3%82%BA%E3%82% There is work to tie to AD_ (% E4% BC% 81% E6% A5% AD)).
This process is called Wikification
.
There is a good article that summarizes the benefits smarter than reading this shit article, so please have a look there as well.
It is no exaggeration to say that Wikipedia articles are generally keywords.
In the example text above, 2016, Yamaha, Honda, Suzuki, Ducati, Aprilia, Maker, Works Machine, Loan
, etc. can be associated with Wikipedia.
Are 2016
and lending
keywords? You may be wondering, but as long as the Wikipedia article exists, it will be a keyword.
After picking up words including garbage words from Wikipedia, if you handle them well, it will be quite useful for keyword extraction.
In the process of Wikification, word meaning ambiguity resolution (WSD) is also implemented. This is because it is common in the real world for a word to have multiple meanings.
For example, Today I went to TSUTAYA and took a new DVD of Bud. Let's say you have the text
. [^ 1]
At this time, if it is a simple text match, ambiguity will occur in the bud
. [AV actress bud](https://ja.wikipedia.org/wiki/%E3%81%A4%E3%81%BC%E3%81%BF_(AV%E5%A5%B3%E5%84%] AA)) or Flower Bud or Kobukuro Song /% E8% 95% BE_ (% E3% 82% B3% E3% 83% 96% E3% 82% AF% E3% 83% AD% E3% 81% AE% E6% 9B% B2)) ・ There are multiple possibilities.
Of course, we humans can infer from the context that "I'm going to TSUTAYA and playing a DVD, so it's probably a'bud (AV actress)`!" It's an excellent human ability. Selecting the correct word meaning from the context in this way is called WSD. This is a research field of natural language processing.
Wikification also does WSD during the "associating words with wikipedia articles" work.
One of the advantages of Wikipedia is that it is __structured data __. Specifically, the category system and article templates play that role.
You can build network graphs using the category structure and article templates. This means that you can use it as useful information for document classification and clustering tasks.
Unfortunately, Wikipedia isn't very well suited for this purpose. The following points have been pointed out as the causes. [^ 3]
--Category system with multiple paths --Use of inconsistent templates
We recommend using data published by DBpedia or WikiData. [^ 2] Although DBpedia and WikiData are rule-based, they do data cleaning.
I made a Wikification package.
The package itself will be installed with pip install word2vec-wikification-py
. (I have only tested with Python 3.5.x.)
Since numpy and gensim are required for installation, we recommend using Anaconda3.
Then run this script with sh download_model.sh
.
It will not work without this model file.
From here, what you do branches off depending on the use case.
First of all, you will need the following two items.
--Environment that can be divided into morphemes --Mysql and Wikipedia dump data
If you do not have an environment for morpheme division, please refer to My article for reference.
Regarding the setup of Mysql, it depends on the environment, so please do your best! (Marunage)
See this section of the README (https://github.com/Kensuke-Mitsuzawa/word2vec_wikification_py#to-those-who-uses-interfacepredict_japanese_wiki_names) for Wikipedia dump data.
Use the function word2vec_wikification_py.interface.predict_japanese_wiki_names_with_wikidump ()
.
The return value is word2vec_wikification_py.word2vec_wikification_py.models.SequenceScore
stored in the list.
"Plausible wikipedia article name series" in descending order of score.
For the time being, if you want a word sequence, you can get the word sequence with word2vec_wikification_py.word2vec_wikification_py.models.SequenceScore.get_tokens ()
.
See example for more information.
Perhaps some of you have already finished listing wikipedia article candidates by some means.
In such a case, follow the steps below to perform WSD.
word2vec_wikification_py.word2vec_wikification_py.models. WikipediaArticleObject
word2vec_wikification_py.interface. compute_wiki_node_probability ()
First, create a list of candidates.
For Yamaha
as shown in the code below, enter Yamaha, or Yamaha Motor and candidate wikipedia article name.
At this time, don't forget to enclose the __article name in []
. __
If you do not enclose it in []
, the accuracy will be significantly reduced.
seq_wikipedia_article_object = [
WikipediaArticleObject(page_title='Yamaha', candidate_article_name=['[Yamaha]', '[Yamaha発動機]']),
WikipediaArticleObject(page_title='Suzuki', candidate_article_name=['[Suzuki_(Company)]', '[Suzuki_(fish)]']),
WikipediaArticleObject(page_title='Ducati', candidate_article_name=['[Ducati]'])
]
It's done in one line. Select ʻentity_vector.model.bin` for the model file.
model_object = load_entity_model.load_entity_model(path_entity_model=path_model_file, is_use_cache=True)
Call word2vec_wikification_py.interface. compute_wiki_node_probability ()
to give candidate information and a model file.
sequence_score_objects = interface.compute_wiki_node_probability(
seq_wiki_article_name=seq_wikipedia_article_object,
entity_vector_model=model_object,
is_use_cache=True
)
The return value is word2vec_wikification_py.word2vec_wikification_py.models.SequenceScore
stored in the list.
"Plausible wikipedia article name series" in descending order of score.
For the time being, if you want a word sequence, you can get the word sequence with word2vec_wikification_py.word2vec_wikification_py.models.SequenceScore.get_tokens ()
.
First of all, as a major premise, this package is built on Japanese Wikipedia entity vector. This is Mr. Suzuki's Research Results from Tohoku University Inui Laboratory. I am struck by Mr. Suzuki, who publishes such wonderful research results.
To briefly explain the role of Japanese Wikipedia entity vectors, it is __ "word2vec model between wikipedia articles" __. Therefore, it is possible to calculate the similar distance between wikipedia articles.
Then, the story is, "What did you do?" Here's what we're doing with this package:
If you know the operation process of MeCab, you can imagine the second process. There is a good article that introduced the operation process of MeCab on Cookpad's technical blog, so I will post it. In this article, the structure called "Lattice" is the graph we are making here.
By the way, wikipedia article information depends on the word information of the Japanese wikipedia entity vector model.
Therefore, new articles that appeared after the latter half of 2016 are not subject to wikification.
I hope the Japanese wikipedia entity vector will also be updated (|ω ・`) Chira|ω ・`) Chira|ω ・`) Chira
)
-I made a Wikification package. ――If you can Wikify, you can acquire keywords and eliminate ambiguity in word meaning. ――Tohoku University Inuiken is amazing! ――Please donate to Wikipedia and write articles for Wikification technology. --For bugs and other issues, please contact Github Issue.
[^ 1]: The buds are cute, aren't they (* ´ω ` *) [^ 2]: Regarding WikiData, I will write an article on how to use it. [^ 3]: Writing in this way is pointed out in the world of Wikipedia, "There is a risk of original research. Please specify the source."