[PYTHON] For those who want to perform natural language processing using WikiPedia's knowledge that goes beyond simple keyword matching

Value of using knowledge beyond keyword matching

In human conversations, you can talk by knowing that "Twitter" and "Facebook" are SNS, and even if you say "Yamaha", "Yamaha" is the motorcycle "Yamaha" or the piano "Yamaha" in the context. "I understand. This is because the knowledge information related to the background of the word is available. As a method of connecting this word with knowledge information, a method called entity linking is often used in natural language processing in recent years.

Screen Shot 2016-07-29 at 3.36.58 PM.png

If you want to check immediately using the code, please install it below.

code:

Required data: --Data you want to analyze

Use Case

The following use cases can be considered when actually using this.

1: Suggestion If a word related to the word appears when searching for a keyword, the search will be easier and it will be beneficial for the user.

Screen Shot 2016-07-29 at 5.24.48 PM.png

2: Dialogue interface Since the sentences issued in the dialogue are short, there is little information. In order to provide advanced answers from this little information, it is essential to link not only words but also related knowledge.

Screen Shot 2016-07-30 at 2.32.11 PM.png

3: Information extraction from Twitter Since there is little information on Twitter tweets, it is difficult to use it for useful information extraction just by using simple keywords. By associating keywords with related knowledge, it is possible to acquire useful information that could not be acquired by keyword matching.

Screen Shot 2016-07-30 at 2.37.02 PM.png

What is Entity Linking?

ACL is a technique that has attracted a lot of attention even in the top cofunence of natural language processing.

The method and method of connecting simple keywords and knowledge may be a link containing detailed information or summary information including supplementary information. There are two important points.

1: Extract only keywords that are considered important from the text 2: Connect information related to keywords

Extract only the keywords that you think are important from the text

Words that are generally considered important can be extracted by keyword matching using Wikificatation, but if you want to simply operate Can be tried by extracting only proper nouns with mecab. Originally, it is necessary to put a model of machine learning to judge whether this keyword is useful or not, but I will not mention it in this article. If you want to know more details, please see the following materials

Entity linking utilizing knowledge base

Connect keywords and related information

Another technique is to simply match the keyword with WikiPedia or DBPedia and use the matched linked information.

If the matching is a vector space, various operations can be performed by calculation, and the usage is widened. This is made possible by the Japanese Wikipedia Entity Vector introduced this time.

Since it is an advanced method of Word2Vec, it can be applied if you know the contents, so if you want to know the details, I recommend you to read the paper. Even if you do not read it, the data after vector calculation has already been prepared, so if you do not have time, you can use it.

Multiple named entity labels for Wikipedia articles

Japanese Wikipedia How to create an entity vector

It's very simple and easy to understand.

1: Divide WikiPedia data into words with mecab etc. 2: Replace the word with the hyperlink in WikiPedia with the title of the link destination 3: If a word with a hyperlink appears again in WikiPedia, it is considered that the hyperlink is not attached, so the same processing as in 2 is performed. 4: Learn with Word2Vec based on the obtained word group.

This makes it possible to extract named entities as words and associate them with real-world entities (entity linking).

Japanese Wikipedia Implementation example using entity vector

Screen Shot 2016-07-30 at 3.01.35 PM.png

The above is an example of system implementation. Try to get the related word from the word using the wikipedia entity vector. The code is posted on github, so I will quote only the important parts.

1: Collect Twitter tweets (this time using Rinna's data) 2: Do the word-separation 3: Extract unique expressions with Wikification 4: Japanese Wikipedia Add vector with entity vector 5: Calculate the cosine similarity and give the one with high similarity as a synonym.

The code is simple so see below.

https://github.com/SnowMasaya/WikiPedia_Entity_Vector_Get_Similarity_word

Ingenuity for speeding up

Ingenuity 1: Use OpenBlas directly

I am calculating the cosine similarity, but since the calculation cost is high, I will try to speed up this part.

Since it is difficult to use OpenBlas for speeding up, I will write a commentary. When installing on Mac, install with the following command.

brew install openblas 

Below, I will specify in which directory the openblas library is located.

[openblas]
libraries = openblas
library_dirs = /usr/local/opt/openblas/lib
include_dirs = /usr/local/opt/openblas/include

The effect is unknown because it is calculated by Cython type, but first check if the memory layout of the vector that calculates the cosine similarity is C style with the following code.

The reason for this check is that Blas performs copy processing when the vector memory layout is C style, but even in the case of Fortran style, it speeds up except for unnecessary processing by not performing the same processing. Because.

    def __force_forder(self, x):
        """
        Converts array x to fortran order Returns a tuple in the form (x is transposed)
        :param x(vector):
        :return:
        """
        if x.flags.c_contiguous:
            return (x.T, True)
        else:
            return (x, False)

Next, the inner product of the vectors is calculated with the following code. After checking the vector type, if it is C type, it is clearly stated that there is conversion processing. If the vector type is Fortran, conversion processing is not required and the calculation can be performed at high speed.

def __faster_dot(self, A, B):
        """
        Use blas libraries directory to perform dot product
        Reference:
            https://www.huyng.com/posts/faster-numpy-dot-product
            http://stackoverflow.com/questions/9478791/is-there-an-enhanced-numpy-scipy-dot-method
        :param A(mat): vector
        :param B(mat): vector
        :return:
        """
        A, trans_a = self.__force_forder(A)
        B, trans_b = self.__force_forder(B)

        return FB.dgemm(alpha=1.0, a=A, b=B, trans_a=trans_a, trans_b=trans_b)

Ingenuity 2: Thread-based parallel and distributed processing

Calculation of cosine similarity is also a bottleneck, but since the number of registered words in WikiPediaEntityVector is large, it takes a lot of time to perform the same process many times.

Since Python basically runs in a single process, I tried to speed it up by implementing thread-based parallel processing.

I am using the Producer Consumer pattern with Queue. In this case, the processing of Consumer is heavy, so we tried to speed up by increasing the number of threads given to Consumer. The size of the consumer is set, and threads are created for that size to operate.

for index in range(args.consumer_size):
        multi_thread_consumer_crawl_instance = threading.Thread(target=producerConsumer.consumer_run, name=consumer_name + str(index))
        multi_thread_consumer_crawl_instance.start()

Example result

Original named entity: [Computational synonyms]

Looking at the following, you can see that words that are difficult to take with a simple keyword match but are highly relevant are taken.

'Akita': ['Nagano', 'Fukushima', 'Kochi', 'Iwate', 'Yamagata', 'Niigata', 'Aomori', 'Kumamoto', 'Morioka'], 

'hundred': ['hundred', 'Ten', 'thousand'],

'Godzilla': ['Godzilla_(1954 movie)', 'Godzilla_(Fictitious monster)', 'Gamera'], 

'3': ['4', '6', '5', '0', '7', '8', '9', '2', '1'],

'Red': ['purple', 'green', 'Green', 'vermilion', 'black', 'Red色', 'Blue', 'White', 'yellow', 'Indigo', 'Blue']

'Pig': ['Cow', 'sheep', 'Sheep', 'Chicken', 'Goat', 'chicken', '山sheep', 'pig', 'cow'], 

'golf': ['Bowling'], 

'bamboo': ['willow', 'Pine']

'5': ['4', '6', '0', '7', '3', '8', '9', '2', '1'], 

'branch': ['Stem', 'leaf', 'branchは'],

'wood': ['Cedar', 'Oak', 'stump', '松のwood'],

'Hmm': ['Pen', 'Gyu'], 

'student': ['student', '大student'],

'Mochi': ['Manju', 'sake bottle', 'Red rice', 'egg', 'Miki', 'Porridge', 'Azuki', 'dumpling'],

'Waist': ['buttocks', 'knee', 'heel', 'shoulder'], 

'beard': ['口beard', 'Beard', '口Beard', 'beard', 'beard', 'hair', 'あごBeard'], 

'Cat': ['Little bird', 'cat', '仔Cat', 'mouse or rat'], 

'China': ['Taiwan', 'Korea', 'Korea', 'People's Republic of China'],

'two': ['Five', 'Two', 'Two', 'three'], 

'yukata': ['yukata', 'Everyday wear', 'Pure white', 'Mourning clothes', 'kimono', 'tuxedo', 'Everyday wear', 'Pure white', 'Mourning clothes', 'kimono', 'tuxedo'], 

'baseball': ['rugby'],

'hair': ['頭hair', '黒hair', '長hair', 'beard', 'hairの毛', '前hair', '金hair', 'hair型'],

'autumn': ['autumn', 'summer', 'spring', 'summer', 'spring'],

'Nara': ['Wakayama']

important point

Named entities are represented by Wikification, so they depend on Wikipedia. The data knowledge space relies on WikiPedia. It is better not to use it if the industry is special or if there are many rare cases. In the Japanese Wikipedia entity vector, hyperlink words are represented by "<< word >>", so processing other than "<< >>" is required. It consumes a lot of memory. The calculation time is also very long. If the original named entity is 192 words, it will take about 3 hours to operate in a single thread in a single process, but since the same process is performed for each named entity, it will be faster if parallel distributed processing is performed.

reference

Entity linking utilizing knowledge base

Yamada123, Ikuya, Hideaki Takeda, and Yoshiyasu Takefuji. "Enhancing Named Entity Recognition in Twitter Messages Using Entity Linking." ACL-IJCNLP 2015 (2015): 136.

Faster numpy dot product for multi-dimensional arrays

scipy.linalg.blas.ddot

numpy.ndarray.flags

Is there an “enhanced” numpy/scipy dot method?

models.word2vec – Deep learning with word2vec

Recommended Posts

For those who want to perform natural language processing using WikiPedia's knowledge that goes beyond simple keyword matching
Loose articles for those who want to start natural language processing
Join Azure Using Go ~ For those who want to start and know Azure with Go ~
5 Reasons Processing is Useful for Those Who Want to Get Started with Python
[For beginners] Language analysis using the natural language processing tool "GiNZA" (from morphological analysis to vectorization)
For those who want to write Python with vim
[For those who want to use TPU] I tried using the Tensorflow Object Detection API 2
For those who want to start machine learning with TensorFlow2
Reference reference for those who want to code in Rhinoceros / Grasshopper
[Natural language processing] I want to meet an engineer who is changing jobs (or just before)
[Short sentence] easygui for those who want to use a simple GUI with Python very easily
PyPI registration steps for those who want to make a PyPI debut
Python techniques for those who want to get rid of beginners
I asked a friend who works in machine learning at a very famous IT company. Machine learning (natural language processing) What I want to learn for self-study