[PYTHON] Why is distributed representation of words important for natural language processing?

** Why is distributed representation of words important for natural language processing? ** **

If you are reading this article, you may be familiar with ** Word2vec ** (Mikolov et al., 2013). Word2vec allows you to perform operations as if you were capturing the meaning of a word. For example, it is a famous example that Queen is obtained by subtracting Man from King and adding Woman (King --Man + Woman = Queen).

from https://www.tensorflow.org/get_started/embedding_viz

In fact, inside that, words are represented by vectors of about 200 dimensions called ** distributed representation ** (or embedded representation), and the vectors are added and subtracted. It is thought that the characteristics of each word are stored inside this vector of about 200 dimensions. Therefore, adding and subtracting vectors can give meaningful results.

** Distributed representation of words is an important technique that is commonly used in today's natural language processing. ** Recently, a huge number of neural network (NN) -based models have been proposed in natural language processing research. These NN-based models often use distributed representations of words as input.

In this article, I will explain "** Why is distributed expression of words important for natural language processing? **". The flow of explanation begins with a brief explanation of the distributed representation of words and a shared understanding. Next, I will explain why the main theme, distributed representation of words, is important for natural language processing. Finally, I will explain the challenges of distributed representation.

What is a distributed expression of words?

Here, we will give a brief explanation for the purpose of understanding the ** distributed expression ** of words. We'll also discuss the ** one-hot representation ** of words for comparison to illustrate their benefits. As for the flow of the story, after explaining the one-hot expression and its problems, we will move on to the explanation of the distributed expression.

one-hot expression

The first possible way to represent a word as a vector is the one-hot representation. A one-hot expression is a method in which only one element is 1 and the other elements are 0. By setting 1 or 0 for each dimension, "whether or not it is the word" is indicated.

For example, let's say the one-hot expression represents the word python. Here, the vocabulary that is a set of words is 5 words (nlp, python, word, ruby, one-hot). Then the vector representing python looks like this: スクリーンショット 2017-03-03 15.18.11.png

The one-hot representation is simple, but it has the disadvantage that operations between vectors do not produce any meaningful results. For example, let's say you take the dot product to calculate the similarity between words. In the one-hot expression, different words have 1s in different places and other elements are 0, so the result of taking the inner product between different words is 0. This is not the desired result. Also, since one dimension is assigned to one word, it becomes very high dimension as the number of vocabulary increases.

Distributed representation

Distributed representations, on the other hand, are representations of words as low-dimensional real-value vectors. It is often expressed in about 50 to 300 dimensions. For example, the words mentioned earlier can be expressed as follows in a distributed expression. スクリーンショット 2017-03-03 15.46.50.png

You can solve the problems of one-hot expressions by using distributed expressions. For example, you will be able to calculate the similarity between words by performing operations between vectors. Looking at the vector above, the similarity between python and ruby is likely to be higher than the similarity between python and word. Also, even if the number of vocabulary increases, it is not necessary to increase the number of dimensions of each word.

Why is distributed expression of words important?

This section describes the importance of distributed representation of words in natural language processing. As for the flow of the story, after talking about the input to the natural language processing task, we will talk about using the distributed expression as the input. And I will talk about how this distributed representation affects the performance of the task.

There are various tasks in natural language processing, but many tasks give a word string as input. Specifically, for document classification, enter a set of words contained in the document. Part-of-speech tagging gives a word-separated word string, and named entity recognition gives a word-separated word string as well. The image is as follows. スクリーンショット 2017-03-07 21.36.34.png

In modern natural language processing, neural networks are often used, but word strings are often given as input. It is a word to input to the RNN that is commonly used in the past, and it is often the case that the input to the model using the CNN, which has been attracting attention recently, is input at the word level. The image is as follows.

In fact, we often use distributed expressions as expressions for words given to these neural networks [^ 1]. It is based on the expectation that using expressions that better capture the meaning of words as input will also improve task performance. It is also possible to use a distributed representation learned with a large amount of unlabeled data as the initial value of the network and tune it with a small amount of labeled data.

This distributed representation is important because it affects the performance of the task. It has also been reported to improve performance compared to not using distributed representation [2]. In this way, distributed expressions of words are important because they are often used as input for many tasks and have a considerable effect on performance.

Challenges of distributed expression of words

It's not that the distributed representation of words is the silver bullet in natural language processing. As a result of many studies, it is known that there are various problems. Here, I will introduce two of them.

Problem 1: Performance does not improve as expected for actual tasks

The first issue is that even good results in the evaluation dataset do not improve performance more than you would expect to use in an actual task (such as document classification). In the first place, how distributed expressions of words are evaluated is often evaluated by the degree of correlation with a human-created evaluation set of word similarity (Schnabel, Tobias, et al, 2015). In other words, using the distributed representation obtained from a model that can produce results that correlate with human evaluation for actual tasks does not improve performance.

The reason is that most evaluation datasets do not distinguish between word similarity and relevance. Word similarity and relevance is, for example, that (male, man) are similar and (computer, keyboard) are related but not similar. It has been reported that the distinguishing datasets have a positive performance correlation with the actual task (Chiu, Billy, Anna Korhonen, and Sampo Pyysalo, 2016).

As a result, attempts are currently being made to create evaluation datasets that correlate with actual tasks (Oded Avraham, Yoav Goldberg, 2016). We are trying to solve two problems that exist in existing datasets (word similarity and relevance are not distinguished, annotation scores vary among evaluators).

In addition to creating evaluation datasets, research has been conducted to evaluate distributed representations by making it easier to evaluate actual tasks (Nayak, Neha, Gabor Angeli, and Christopher D. Manning, 2016). It is expected that this will make it easy to verify whether the learned distributed representation is effective for tasks that are close to the task that you actually want to perform.

Personally, I hope that the models that have been buried up to now will be reviewed by evaluating with new data sets and tasks.

Problem 2: Does not consider word ambiguity

The second issue is that current distributed expressions do not take into account word ambiguity. Words have various meanings. For example, the word "bank" has the meaning of "bank" in addition to the meaning of "bank". In this way, there is a limit to expressing with one vector without considering the ambiguity of words.

Several methods have been proposed to solve this problem by learning expressions for each semantics [5] [6] [7] [8]. At SENSE EMBED, we are learning expressions for each meaning by eliminating ambiguity in the meaning. As a result of learning the expressions for each meaning, it is reported that the performance in word similarity evaluation has improved.

For those who want to know more

The following repositories contain information on distributed representations of words and sentences, learned vectors, and Python implementations. awesome-embedding-models

It will be encouraging if you can add a star m (_ _) m

in conclusion

Distributed expression of words is an interesting field that is being actively studied. I hope this article will help you understand.

The following Twitter account provides easy-to-understand information on the latest papers on ** machine learning / natural language processing / computer vision **. We are waiting for you to follow us as we are delivering interesting content for those who read this article. @arXivTimes

I also tweet information about machine learning and natural language processing in my account, so I'd love to hear from anyone interested in this field. @Hironsan

Annotation

[^ 1]: Actually, after giving it in one-hot expression, it is converted to distributed expression.

Reference material

  1. Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
  2. Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global Vectors for Word Representation." EMNLP. Vol. 14. 2014.
  3. Schnabel, Tobias, et al. "Evaluation methods for unsupervised word embeddings." EMNLP. 2015.
  4. Chiu, Billy, Anna Korhonen, and Sampo Pyysalo. "Intrinsic evaluation of word vectors fails to predict extrinsic performance." ACL 2016 (2016): 1.
  5. Oded Avraham, Yoav Goldberg. "Improving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure." arXiv preprint arXiv:1611.03641 (2016).
  6. Nayak, Neha, Gabor Angeli, and Christopher D. Manning. "Evaluating Word Embeddings Using a Representative Suite of Practical Tasks." ACL 2016 (2016): 19.
  7. Trask, Andrew, Phil Michalak, and John Liu. "sense2vec-A fast and accurate method for word sense disambiguation in neural word embeddings." arXiv preprint arXiv:1511.06388 (2015).
  8. Iacobacci, I., Pilehvar, M. T., & Navigli, R. (2015). SensEmbed: Learning Sense Embeddings for Word and Relational Similarity. In ACL (1) (pp. 95-105).
  9. Reisinger, Joseph, and Raymond J. Mooney. "Multi-prototype vector-space models of word meaning." Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2010.
  10. Huang, Eric H., et al. "Improving word representations via global context and multiple word prototypes." Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012.

Recommended Posts

Why is distributed representation of words important for natural language processing?
Natural language processing for busy people
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
Unbearable shortness of Attention in natural language processing
Set up a development environment for natural language processing
Python: Natural language processing
RNN_LSTM2 Natural language processing
Performance verification of data preprocessing in natural language processing
Building an environment for natural language processing with Python
Overview of natural language processing and its data preprocessing
3. Natural language processing with Python 5-5. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (words)]
Types of preprocessing in natural language processing and their power
■ [Google Colaboratory] Preprocessing of Natural Language Processing & Morphological Analysis (janome)
Which method is best for asynchronous processing of TCP server?
The image display function of iTerm is convenient for image processing.
Natural language processing 1 Morphological analysis
Python: Natural language vector representation
Natural language processing 2 Word similarity
100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name
Dockerfile with the necessary libraries for natural language processing in python
Loose articles for those who want to start natural language processing
[Word2vec] Let's visualize the result of natural language processing of company reviews
Study natural language processing with Kikagaku
100 natural language processing knocks Chapter 4 Commentary
[Natural language processing] Preprocessing with Japanese
Artificial language Lojban and natural language processing (artificial language processing)
100 Language Processing Knock-59: Analysis of S-expressions
100 language processing knock 2020 "for Google Colaboratory"
TensorFlow Tutorial-Vector Representation of Words (Translation)
Preparing to start natural language processing
Natural language processing analyzer installation summary
Summary of multi-process processing of script language
Easy padding of data that can be used in natural language processing
Get a distributed representation of words in Fast with fastText on Facebook
Learn the basics of document classification by natural language processing, topic model
Why is cross entropy used for the objective function of the classification problem?