[PYTHON] [gensim] How to use Doc2Vec

It's usually written officially by gensim, but there aren't many Japanese materials, so I'll summarize the basic ones I often use for beginners.

Preparation (installation)

pip install gensim

Formation of training data

The writing style is different depending on the site, but personally I am calm with this writing style

#coding: UTF-8
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

f = open('Training data.txt','r')#Text data that separates words with spaces and separates documents with line breaks

#Divide each document into words and put them in the list[([Word 1,Word 2,Word 3],Document id),...]Such an image
#words: List of words contained in the document (with duplicate words)
#tags: Document identifier (specified in a list. Multiple tags can be added to one document)
trainings = [TaggedDocument(words = data.split(),tags = [i]) for i,data in enumerate(f)]

By the way, what I learned this time is the data of 12 million reviews of Reading Meter. I collected it by scraping. Since it exceeds 1GB, it is quite difficult to get on the memory depending on the PC

Model learning

#Training (later on parameters)
m = Doc2Vec(documents= trainings, dm = 1, size=300, window=8, min_count=10, workers=4)

#Save model

#Model loading(If you have a model, you can start here)
m = Doc2Vec.load('model/doc2vec2.model')

Note that it may take a long time depending on the size of the training data.

Frequently used functions

Among the learning documents, check the document with a high degree of similarity to the document with the specified id.

#The argument is the document id
print m.docvecs.most_similar(0)

#Returns a set of top 10 document ids and similarity similar to document 0
>> [(55893, 0.6868613362312317), (85550, 0.6866280436515808), (80831, 0.6864551305770874), (61463, 0.6863148212432861), (72602, 0.6847503185272217), (56876, 0.6835699081420898), (80847, 0.6832736134529114), (92838, 0.6829516291618347), (24495, 0.6820268630981445), (45589, 0.679581880569458)]

Examine the similarity between arbitrary documents

print m.docvecs.similarity(1,307)
#Similarity between document 1 and document 307
>> 0.279532733106

Use the trained model to find out the similarity between newly given documents

#For example, try to calculate the similarity of some combinations of the following four new documents.
doc_words1 = ["last", "Deployment" ,"early" ,"other" ,"the work", "impact", "receive" ,"Behind the back" ,"Tsukuri", "trick" ,"Every time" ,"thing", "Take off your hat", "To do", "Read", "Cheap" ,"Me" ,"Mystery"]
doc_words2 = [ "Initiation love", "Similarly" ,"last", "A few lines", "Plot twist", "Go", "Time", "Time", "various", "scene", "To do" ,"To be", "Foreshadowing" ,"Sprinkle", "らTo be" ,"Is", "thing", "notice"]
doc_words3 = ["last", "Deployment" ,"early" ,"other" ,"the work", "impact", "receive" ,"Behind the back" ,"Tsukuri","Mystery"]
doc_words4 = ["Unique", "View of the world", "Everyday" ,"Leave","Calm down","Time","Read","Book"]

print "1-2 sim"
sim_value = m.docvecs.similarity_unseen_docs(m, doc_words1, doc_words2, alpha=1, min_alpha=0.0001, steps=5)
print sim_value

print "1-3 sim"
print m.docvecs.similarity_unseen_docs(m, doc_words1, doc_words3, alpha=1, min_alpha=0.0001, steps=5)

print "1-4 sim"
print m.docvecs.similarity_unseen_docs(m, doc_words1, doc_words4, alpha=1, min_alpha=0.0001, steps=5)

print "2-3 sim"
print m.docvecs.similarity_unseen_docs(m, doc_words2, doc_words3, alpha=1, min_alpha=0.0001, steps=5)

>> 1-2 sim
   1-3 sim
   1-4 sim
   2-3 sim

Even if people look at it, it is clear that documents 1-3 and 2-3 are similar, and on the contrary, documents 1-4 are not similar, so the similarity is quite good.

Output the compression vector of the new document (output as the vector of the number of dimensions specified by size when learning)

newvec = m.infer_vector(doc_words1)

print newvec

>> [  1.19107231e-01  -4.06390838e-02  -2.55129002e-02   1.16982162e-01
  -1.47758834e-02   1.07912444e-01  -4.76960577e-02  -9.73785818e-02
  -1.61364377e-02  -9.76370368e-03   4.98018935e-02  -8.88026431e-02
   1.34409174e-01  -1.01136886e-01  -4.24979888e-02   7.16169327e-02]

What I want to add in the future

--Adjustment of parameters when training the model ――What can it be applied to?

Also, regarding the doc2vec algorithm itself I found an article explained on the blog of Kitayama Lab. Of Kogakuin University. [algorithm of doc2vec (Paragraph Vector)](https://kitayamalab.wordpress.com/2016/12/10/algorithm of doc2vecparagraph-vector-/)

Recommended Posts

[gensim] How to use Doc2Vec
How to use xml.etree.ElementTree
How to use Python-shell
How to use tf.data
How to use virtualenv
How to use Seaboan
How to use image-match
How to use shogun
How to use Pandas 2
How to use Virtualenv
How to use numpy.vectorize
How to use pytest_report_header
How to use partial
How to use Bio.Phylo
How to use SymPy
How to use x-means
How to use WikiExtractor.py
How to use IPython
How to use virtualenv
How to use Matplotlib
How to use iptables
How to use numpy
How to use TokyoTechFes2015
How to use venv
How to use dictionary {}
How to use Pyenv
How to use list []
How to use python-kabusapi
How to use OptParse
How to use return
How to use dotenv
How to use pyenv-virtualenv
How to use Go.mod
How to use imutils
How to use import
How to use Qt Designer
How to use search sorted
python3: How to use bottle (2)
Understand how to use django-filter
How to use the generator
[Python] How to use list 1
How to use FastAPI ③ OpenAPI
How to use Python argparse
How to use IPython Notebook
How to use Pandas Rolling
[Note] How to use virtualenv
How to use redis-py Dictionaries
Python: How to use pydub
[Python] How to use checkio
[Go] How to use "... (3 periods)"
How to use Django's GeoIp2
[Python] How to use input ()
How to use the decorator
[Introduction] How to use open3d
How to use Python lambda
How to use Jupyter Notebook
[Python] How to use virtualenv
python3: How to use bottle (3)
python3: How to use bottle
How to use Google Colaboratory
How to use Python bytes