[PYTHON] Summarize Doc2Vec

Introduction

This time, I studied ** Doc2Vec ** as a development of Word2Vec. "Document classification" and "document grouping (clustering)" are tasks that are often required in natural language processing, but in order to perform them, a distributed representation of the document itself is required. You can use Doc2Vec to get the distributed representation directly.

reference

In understanding Doc2Vec, I referred to the following.

-Doc2vec (Paragraph Vector) algorithm -Distributed Representations of Sentences and Documents -Document similarity calculation tutorial using Doc2Vec mechanism and gensim -How to use natural language processing technology-I tried to predict the quality of papers using Doc2Vec and DAN! ー

Doc2Vec

What is Doc2Vec

Doc2Vec is a technology that converts sentences of arbitrary length into fixed-length vectors. Whereas Word2Vec gets a distributed representation of words, Doc2Vec gets a distributed representation of sentences and documents. Bag-of-Words and TF-IDF are the classic methods for acquiring distributed expressions in sentences, but they have the following weaknesses.

--Does not have word order information for words in sentences --Recognize synonyms as completely different independent words

These are methods called count-based, but Doc2Vec is trying to acquire distributed expressions of sentences by a different approach to overcome the above weaknesses.

Doc2Vec algorithm

Doc2Vec is a general term for the following two algorithms.

The following is a brief description of each algorithm.

PV-DM PV-DM is an algorithm that corresponds to Word2Vec's CBOW. You will get a distributed representation of the sentence while solving the task of passing the sentence id and multiple words and predicting the next word. As far as I can tell, it seems that learning is done by the following procedure.

  1. Prepare a sentence vector and a word vector sampled from the document.
  2. Combine the vectors prepared in 1 in the intermediate layer (mean or concatenation, selectable in gensim)
  3. Predict the next word following the sampled word
  4. Updated text vector and intermediate layer → output layer weight

The image is shown below. (Quoted from the original paper described in "Reference")

スクリーンショット 2020-02-12 20.41.34.png

PV-DBOW PV-DBOW is an algorithm that supports Word2Vec's skip-gram. PV-DBOW can be learned faster than PV-DM because it does not need to use word vectors for learning. However, PV-DBOW is said to be more accurate than PV-DM because it ignores the word order during learning.

As far as I can tell, it seems that learning is done by the following procedure.

  1. Sample any number of words from the same sentence
  2. Optimize sentence vector and middle layer → output layer weights to predict sampled words

The image is shown below. (Quoted from the original paper described in "Reference")

スクリーンショット 2020-02-13 21.16.01.png


I briefly summarized the two algorithms above, but there were some details that I couldn't understand. If anyone knows, please let me know in the comments. .. ..

** I didn't understand even after checking **

――What kind of format is the document vector given as input first, and is there any problem in recognizing that only the id of the sentence is passed? ――Where does the paragraph vector of the sentence that can be finally obtained as a result of this learning come from? (In Word2Vec, the weight vector that converts the input to the middle layer is the distributed expression of the word)

Creating a Doc2vec model using the library

In the following, we will actually create a model of Doc2Vec using the library.

Library used

gensim 3.8.1

data set

You can easily create a model of Doc2Vec using gensim, which is a python library. This time, we will use "livedoor news corpus" for the dataset. For details of the dataset and the method of morphological analysis, please refer to Posted in the previously posted article. I will.

In the case of Japanese, preprocessing that decomposes sentences into morphemes is required in advance, so after decomposing all sentences into morphemes, they are dropped into the following data frame.

スクリーンショット 2020-01-13 21.07.38.png

The rightmost column is the one in which all sentences are morphologically analyzed and separated by half-width spaces. Use this to create a Doc2Vec model.

Model learning

Create a Word2vec model using gensim. Below are the main parameters for creating a model.

Parameter name Meaning of parameters
dm PV if 1=PV if DM is 0-Learn with DBOW
vector_size Specify how many dimensions of the distributed representation the text should be converted to
window How many words to use to predict the next word(PV-For DM)Or how many words to predict from the document id(PV-For DBOW)
min_count Ignore words that appear less than the specified number
wokers Number of threads used for learning

Below is the code to create a model for Doc2Vec. As long as you can create the text to be input, you can create a model in one line.


sentences = []
for text in df[3]:
    text_list = text.split(' ')
    sentences.append(text_list)

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(sentences)]
model = Doc2Vec(documents, vector_size=2, window=5, min_count=1, workers=4)

What you can do with Doc2Vec

I was able to acquire a distributed expression of sentences by the model of Doc2Vec. The distributed representation of sentences can be used to quantitatively express the semantic distance between sentences.

Let's find out what kind of articles are similar to the following news articles in the model created earlier.

'19th (Sat / Japan time) South Africa World Cup / Japan x Netherlands match is a good fight for the Japanese national team
In the second half, Sneijder's goal allowed him to lead, but he couldn't make it to 0.-Lost by 1.
Hidetoshi Nakata, a former Japan national team commander who was in charge of commentary on TV Asahi after the match,
"Well, 0-Even though I lost in 1, the team was a much better fight than in the first match, and especially in the second half, Japan was also making a good attack, and I think it was the next match. I think, "he said.
In addition, "(Although there are cases where we lose points with slurping), we will connect our attacks after properly protecting them.
Especially at the end, Manager Okada also uses the attacking player from an early stage, and that attitude will lead to the next game.
It's not okay to lose this game, and it's great to see the attitude of winning. "'

You can output a document similar to the specified document below.


#Pass the id of the document and output the document close to it(In this case 5792 is the id of the above article)
model.docvecs.most_similar(5792)

Click here for output. Returns a set of document ids and their cosine similarity.

[(6084, 0.8220762014389038),
 (5838, 0.8150338530540466),
 (6910, 0.8055128455162048),
 (351, 0.8003012537956238),
 (6223, 0.7960485816001892),
 (5826, 0.7933120131492615),
 (6246, 0.7902486324310303),
 (6332, 0.7871333360671997),
 (6447, 0.7836691737174988),
 (6067, 0.7836177349090576)]

Let's take a look at the contents of articles with high similarity.

##Article content of 6084

'Soccer U aiming to participate in the London Olympics-22 Japan national team.
The second qualifying round is a home and away match against one country in Kuwait.
It is U that attracts attention-It is Japanese ace Kensuke Nagai (Nagoya Grampus) who boasts 8 goals in 11 games of 22 national teams.
18th midnight broadcast, TBS "S-1 ”is 50m 5.I approached the fast-paced FW that I expected to run through in 8 seconds.
Nagai said, "Yoi Don won't lose." "I don't know for myself, but it seems to be fast.
"It was about my second year in high school (I got faster)," he said, but as "Nagai's speed legend," a friend who knew about his high school days said, "Catch up with a car, about 40 kilometers." ..
"Run that guy," said Koichi Sugiyama, a high school teacher at Kyushu International University High School. "Since the child on the other side was offside in the through pass I gave, I chased myself and became dribble. That was often the case. "
Speaking of fast-paced soccer players, "wilderness" Masayuki Okano is too famous, but Nagai, who was asked about Okano, laughed bitterly, "It's not that fast."
Nagai, who was petite and not fast when he entered high school, said, "I often fell and returned." "It was timely, his growth was good for physical growth and training. It may have been, "says Sugiyama, who recalls that his ability has blossomed through hell training using the high school's specialty, slopes and stairs.'
##Article content of 5838

'In the TBS sports program "S1" broadcast at midnight on the 29th, guest commentary Ruy Ramos expressed anger at the current situation where the Japanese national team coach has not been decided.
The only countries that have not decided on a national coach in the World Cup are Japan and North Korea. "No, it's too pathetic. Especially the motivation of the selected players will drop.
Looking at the members this time, Inamoto and Tamada, why aren't they selected? "Ramos begins to talk with a blatantly muffled expression.
When asked, "Mr. Ramos, should I do it?", He said "No, I want to do it. That's definitely true." He also said, "I can't be chosen. Active duty. At that time. It's too late. I'm really sorry. I'm lonely. "'

You can see that the top two are both football topics. In addition, the corpus learned was composed of articles from nine news sites, and eight of the top 10 similarities were articles from sports news sites.

If sentences can be converted into fixed-length vector representations in this way, they can be applied to various machine learning algorithms such as clustering and classification. In the world of natural language processing, it is very important how to acquire highly expressive vector expressions, and various algorithms such as Doc2Vec introduced this time have been developed.

Next I will try various natural language processing tasks using this Doc2Vec model.

Recommended Posts

Summarize Doc2Vec
Let's summarize Squid
Let's summarize Apache
Summarize Python import