[PYTHON] Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 11 Memo "Word Embeddings"

Contents

This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 3, Step 11, make a note of your own points. Personally, this is a technology that I am particularly interested in in natural language processing.

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

The conventional feature extraction method is BoW (or a variant of BoW), which is expressed by the number of dimensions equal to the number of vocabularies. In word embeddings, it can be represented by a vector of a specific number of dimensions (** distributed representation of words **). This vector has information as if it represents the meaning of a word.

11.1 What are Word embeddings?

Comparison with One-hot expressions such as BoW

item One-hot expression Word embeddings
Number of dimensions of vector ・ Number of vocabulary
・ It can be tens of thousands to millions
・ Fixed value decided by the designer
・ Hundreds
Vector value 1 for certain dimensions, 0 for others All dimensions take real numbers

11.2 Get in touch with Word embeddings

Analogy task

analogy_sample.Try running py


$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python analogy_sample.py
tokyo - japan + france =  ('paris', 0.9174968004226685)

gensim.downloader.load ('<word embeddings model>') holds the feature vector corresponding to the word. However, depending on the model, only English words may be used.

This example deals with the words tokyo, japan, and france, and if you try a pseudo calculation based on the meaning of the words, It can be seen that it can be expressed as (tokyo --japan) + france = (capital) + france ≒ paris.

Application issues

Let's try some examples read in other books.

king - man + woman =  [('king', 0.8859834671020508), ('queen', 0.8609581589698792), ('daughter', 0.7684512138366699), ('prince', 0.7640699148178101), ('throne', 0.7634970545768738), ('princess', 0.7512727975845337), ('elizabeth', 0.7506488561630249), ('father', 0.7314497232437134), ('kingdom', 0.7296158075332642), ('mother', 0.7280011177062988)]

gone - go + see =  [('see', 0.8548812866210938), ('seen', 0.8507398366928101), ('still', 0.8384071588516235), ('indeed', 0.8378400206565857), ('fact', 0.835073709487915), ('probably', 0.8323071002960205), ('perhaps', 0.8315557837486267), ('even', 0.8241520524024963), ('thought', 0.8223952054977417), ('much', 0.8205327987670898)]

Synonyms

It can be obtained by model.wv.similar_by_vector (..) as dealt with in Analogy task.

The nature of Word embeddings

The distributed representation obtained by Word embeddings has the following properties.

――The addition and subtraction of vectors can express the addition and subtraction of meaning. --Distributed expressions of words with similar meanings are distributed close together in the vector space.

Types of Word embeddings

item Contents
Word2Vec Obtain a distributed expression by focusing on several consecutive words in a sentence
Glove Obtain a distributed representation by using word co-occurrence frequency information in the entire learning data
fastText Character n-Get the distributed representation of gram and add them together to make the distributed representation of words

11.3 Use of trained model and Japanese support

As mentioned above, Word embeddings can use a model that has already been trained. so)

When using the trained model that is distributed later, pay attention to the license and terms of use of that model.

11.4 word embeddings in the identification task

Use distributed representation as a feature

simple_we_classification.py is undersec130_140_cnn_rnn / classification /. Since tokenize.py does not exist in this directory, I used sec40_preprocessing / tokenizer.py.

Additions / changes from the previous chapter (Step 09)

--Feature extraction change: TF-IDF → Word2Vec

def calc_text_feature(text):
    """
Find the features of text based on the distributed representation of words.
After tokenizing the text and finding the distributed representation of each token,
Let the sum of all distributed representations be the feature of text.
    """
    tokens = tokenize(text)

    word_vectors = np.empty((0, model.wv.vector_size))
    for token in tokens:
        try:
            word_vector = model[token]
            word_vectors = np.vstack((word_vectors, word_vector))
        except KeyError:
            pass

    if word_vectors.shape[0] == 0:
        return np.zeros(model.wv.vector_size)
    return np.sum(word_vectors, axis=0)

Execution result


$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python simple_we_classification.py
0.40425531914893614

Normal implementation (Step 01): 37.2% Pre-processing added (Step02): 43.6% Preprocessing + feature extraction change (Step04): 58.5% Pretreatment + feature extraction change (Step 11): 40.4%

The performance is low with the sentence-level features obtained by simply adding word embeddings.


Pretreatment + feature extraction change + classifier change (Step06): 61.7% Preprocessing + feature extraction change + classifier change (Step09): 66.0%

Unification of morphological analyzer and preprocessing

It is desirable that the word embeddings model to be used ** reproduces the same word-separation method and preprocessing used when it was learned **.

Recommended Posts

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 11 Memo "Word Embeddings"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 02 Memo "Pre-processing"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 07 Memo "Evaluation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 14 Memo "Hyperparameter Search"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 04 Memo "Feature Extraction"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 15 Memo "Data Collection"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 08 Memo "Introduction to Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 05 Memo "Features Conversion"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 03 Memo "Morphological Analysis and Word Separation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 12 Memo "Convolutional Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 09 Memo "Identifier by Neural Network"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 01 Memo "Let's Make a Dialogue Agent"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 10 Memo "Details and Improvements of Neural Networks"
[WIP] Pre-processing memo in natural language processing
Summary from the beginning to Chapter 1 of the introduction to design patterns learned in the Java language
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
[Job change meeting] Try to classify companies by processing word-of-mouth in natural language with word2vec
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
[Python] Try to classify ramen shops by natural language processing
Natural language processing 3 Word continuity
Natural language processing 2 Word similarity
Summary of Chapter 2 of Introduction to Design Patterns Learned in Java Language
Chapter 4 Summary of Introduction to Design Patterns Learned in Java Language
Summary of Chapter 3 of Introduction to Design Patterns Learned in Java Language
[Introduction to RasPi4] Environment construction; natural language processing system mecab, etc. .. .. ♪
Dockerfile with the necessary libraries for natural language processing in python
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
100 natural language processing knocks Chapter 4 Commentary
100 Language Processing Knock Chapter 1 in Python
Web application development memo in python
[Language processing 100 knocks 2020] Chapter 7: Word vector
100 Language Processing Knock 2020 Chapter 7: Word Vector
Cython to try in the shortest
Preparing to start natural language processing
From the introduction of GoogleCloudPlatform Natural Language API to how to use it
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]