[PYTHON] Let's try Wikification ~ Information extraction using Wikipedia & ambiguity resolution ~

What to introduce in this article

-Introduced because I made Package for Wikification --What is Wikification? ――What are the advantages of Wikification?

What is Wikifcation in the first place? ??

In a nutshell, it's the task of associating words in the text with Wikipedia articles.

As an example, let's say you have a sentence like this:

`As of 2016, there are five manufacturers participating, Yamaha, Honda, Suzuki, Ducati, and Aprilia, and a satellite team that can receive works machine loans. ``

This is the [Road Race Championship](https://ja.wikipedia.org/wiki/%E3%83%AD%E3%83%BC%E3%83%89%E3%83%AC%E3%83%BC It is a passage from% E3% 82% B9% E4% B8% 96% E7% 95% 8C% E9% 81% B8% E6% 89% 8B% E6% A8% A9).

At this time, the Suzuki that exists in the text is [Suzuki (company)](https://ja.wikipedia.org/wiki/%E3%82%B9%E3%82%BA%E3%82% There is work to tie to AD_ (% E4% BC% 81% E6% A5% AD)). This process is called Wikification.

What are the benefits of Wikification?

There is a good article that summarizes the benefits smarter than reading this shit article, so please have a look there as well.

Extraction of important keywords

It is no exaggeration to say that Wikipedia articles are generally keywords.

In the example text above, 2016, Yamaha, Honda, Suzuki, Ducati, Aprilia, Maker, Works Machine, Loan, etc. can be associated with Wikipedia.

Are 2016 and lending keywords? You may be wondering, but as long as the Wikipedia article exists, it will be a keyword.

After picking up words including garbage words from Wikipedia, if you handle them well, it will be quite useful for keyword extraction.

Disambiguation of word meaning

In the process of Wikification, word meaning ambiguity resolution (WSD) is also implemented. This is because it is common in the real world for a word to have multiple meanings.

For example, Today I went to TSUTAYA and took a new DVD of Bud. Let's say you have the text . [^ 1]

At this time, if it is a simple text match, ambiguity will occur in the bud. [AV actress bud](https://ja.wikipedia.org/wiki/%E3%81%A4%E3%81%BC%E3%81%BF_(AV%E5%A5%B3%E5%84%] AA)) or Flower Bud or Kobukuro Song /% E8% 95% BE_ (% E3% 82% B3% E3% 83% 96% E3% 82% AF% E3% 83% AD% E3% 81% AE% E6% 9B% B2)) ・ There are multiple possibilities.

Of course, we humans can infer from the context that "I'm going to TSUTAYA and playing a DVD, so it's probably a'bud (AV actress)`!" It's an excellent human ability. Selecting the correct word meaning from the context in this way is called WSD. This is a research field of natural language processing.

Wikification also does WSD during the "associating words with wikipedia articles" work.

Features for machine learning

One of the advantages of Wikipedia is that it is __structured data __. Specifically, the category system and article templates play that role.

You can build network graphs using the category structure and article templates. This means that you can use it as useful information for document classification and clustering tasks.

Unfortunately, Wikipedia isn't very well suited for this purpose. The following points have been pointed out as the causes. [^ 3]

--Category system with multiple paths --Use of inconsistent templates

We recommend using data published by DBpedia or WikiData. [^ 2] Although DBpedia and WikiData are rule-based, they do data cleaning.

So how do you use it?

I made a Wikification package.

The package itself will be installed with pip install word2vec-wikification-py. (I have only tested with Python 3.5.x.) Since numpy and gensim are required for installation, we recommend using Anaconda3.

Then run this script with sh download_model.sh. It will not work without this model file.

From here, what you do branches off depending on the use case.

I want to Wikify from the plaintext state, I want to Wikify from the morpheme-divided state

First of all, you will need the following two items.

--Environment that can be divided into morphemes --Mysql and Wikipedia dump data

Morpheme division environment

If you do not have an environment for morpheme division, please refer to My article for reference.

Preparation of Mysql and Wikipedia dump data

Regarding the setup of Mysql, it depends on the environment, so please do your best! (Marunage)

See this section of the README (https://github.com/Kensuke-Mitsuzawa/word2vec_wikification_py#to-those-who-uses-interfacepredict_japanese_wiki_names) for Wikipedia dump data.

Implement wikificaion

Use the function word2vec_wikification_py.interface.predict_japanese_wiki_names_with_wikidump ().

The return value is word2vec_wikification_py.word2vec_wikification_py.models.SequenceScore stored in the list. "Plausible wikipedia article name series" in descending order of score.

For the time being, if you want a word sequence, you can get the word sequence with word2vec_wikification_py.word2vec_wikification_py.models.SequenceScore.get_tokens ().

See example for more information.

I know the list of Wikipedia article name candidates. I want to do WSD.

Perhaps some of you have already finished listing wikipedia article candidates by some means.

In such a case, follow the steps below to perform WSD.

  1. Generate word2vec_wikification_py.word2vec_wikification_py.models. WikipediaArticleObject
  2. Load the model file
  3. Call word2vec_wikification_py.interface. compute_wiki_node_probability ()

Creation of candidate information

First, create a list of candidates. For Yamaha as shown in the code below, enter Yamaha, or Yamaha Motor and candidate wikipedia article name. At this time, don't forget to enclose the __article name in []. __ If you do not enclose it in [], the accuracy will be significantly reduced.


seq_wikipedia_article_object = [
            WikipediaArticleObject(page_title='Yamaha', candidate_article_name=['[Yamaha]', '[Yamaha発動機]']),
            WikipediaArticleObject(page_title='Suzuki', candidate_article_name=['[Suzuki_(Company)]', '[Suzuki_(fish)]']),
            WikipediaArticleObject(page_title='Ducati', candidate_article_name=['[Ducati]'])
        ]

Load model file

It's done in one line. Select ʻentity_vector.model.bin` for the model file.


model_object = load_entity_model.load_entity_model(path_entity_model=path_model_file, is_use_cache=True)

Interface call

Call word2vec_wikification_py.interface. compute_wiki_node_probability () to give candidate information and a model file.


sequence_score_objects = interface.compute_wiki_node_probability(
          seq_wiki_article_name=seq_wikipedia_article_object,
            entity_vector_model=model_object,
            is_use_cache=True
        )

The return value is word2vec_wikification_py.word2vec_wikification_py.models.SequenceScore stored in the list. "Plausible wikipedia article name series" in descending order of score.

For the time being, if you want a word sequence, you can get the word sequence with word2vec_wikification_py.word2vec_wikification_py.models.SequenceScore.get_tokens ().

How is it working?

First of all, as a major premise, this package is built on Japanese Wikipedia entity vector. This is Mr. Suzuki's Research Results from Tohoku University Inui Laboratory. I am struck by Mr. Suzuki, who publishes such wonderful research results.

To briefly explain the role of Japanese Wikipedia entity vectors, it is __ "word2vec model between wikipedia articles" __. Therefore, it is possible to calculate the similar distance between wikipedia articles.

Then, the story is, "What did you do?" Here's what we're doing with this package:

  1. Create wikipedia article candidates from input words
  2. Create a graph using a combination of article candidates
  3. Select the optimal path from the graph (Japanese Wikipedia entity vector is used to calculate the optimal path)

If you know the operation process of MeCab, you can imagine the second process. There is a good article that introduced the operation process of MeCab on Cookpad's technical blog, so I will post it. In this article, the structure called "Lattice" is the graph we are making here.

By the way, wikipedia article information depends on the word information of the Japanese wikipedia entity vector model. Therefore, new articles that appeared after the latter half of 2016 are not subject to wikification. I hope the Japanese wikipedia entity vector will also be updated (|ω ・`) Chira|ω ・`) Chira|ω ・`) Chira

Summary

-I made a Wikification package. ――If you can Wikify, you can acquire keywords and eliminate ambiguity in word meaning. ――Tohoku University Inuiken is amazing! ――Please donate to Wikipedia and write articles for Wikification technology. --For bugs and other issues, please contact Github Issue.


[^ 1]: The buds are cute, aren't they (* ´ω ` *) [^ 2]: Regarding WikiData, I will write an article on how to use it. [^ 3]: Writing in this way is pointed out in the world of Wikipedia, "There is a risk of original research. Please specify the source."

Recommended Posts

Let's try Wikification ~ Information extraction using Wikipedia & ambiguity resolution ~
Let's try neural machine translation using Transformer
Let's try real-time object detection using Faster R-CNN