[PYTHON] Construction of recommendation system using word-of-mouth doc2vec

I'm bokeneko, a craft beer manager and engineer at Retty. I made an experimental recommendation system using doc2vec, so I will introduce the method.

doc2vec

doc2vec is an evolution of word2vec. word2vec tries to capture the meaning of a word by what words are likely to appear around it, but doc2vec learns to add more context to it.

For example, the idea of word2vec is that "dog" and "cat" have similar meanings because "dog" and "cat" are included in xxx of the sentence "I have xxx". is. However, if this sentence is from a novel about dogs, "dog" is overwhelmingly easier to come up than "cat", and if it is a passage from SM novels ... well, the easy words change. I hope you understand. In other words, since the ease with which words appear depends on the context of the sentence, doc2vec allows you to learn what kind of sentence the sentence is from what kind of word is used.

For more information, please read Distributed Representations of Sentences and Documents. To be honest, I may have misinterpreted it because it is read diagonally.

Review doc2vec

You can learn the distributed expression of reviews by applying reviews to doc2vec. There is doc2vec in gensim, so let's use it. I'm really grateful that it's easy to try it out, aside from the theory.

# -*- coding:utf-8 -*-

from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

import MeCab
import csv

mt = MeCab.Tagger()

reports = []
with open("reports.tsv") as f:
    # reports.Word-of-mouth ID in one line for tsv,Reviews are saved in tab delimiters
    reader = csv.reader(f, delimiter="\t")
    for report_id, report in reader:
        words = []
        node = mt.parseToNode(report)
        while node:
            if len(node.surface) > 0:
                words.append(node.surface)
            node = node.next
        #words is a list of word reviews,Specify word-of-mouth ID in tags
        reports.append(TaggedDocument(words=words, tags=[report_id]))

model = Doc2Vec(documents=reports, size=128, window=8, min_count=5, workers=8)
model.save("doc2vec.model")

Now you can learn the word-of-mouth as a 128-dimensional vector. You can check the learned word-of-mouth vector as follows.

# -*- coding:utf-8 -*-

from gensim.models.doc2vec import Doc2Vec

model = Doc2Vec.load("doc2vec.model")
sample_report_id = .... #Word-of-mouth ID for which you want to check the distributed expression

report_vector = model.docvecs[sample_report_id]

Recommendation

Now that we have learned the word-of-mouth vector, how to make a recommendation is to think of the added average of all the word-of-mouth vectors of a user as the vector that represents that user. In the same way, consider the added average of all word-of-mouth vectors of a store as the vector representing that store. Using the user vector and store vector created in this way

I thought I could say that. It's just a hypothesis, so let's do it immediately because we have to try it to see if it works.

ngt

I used ngt to calculate the neighborhood of the vector speedily.

As a usage, prepare the following values for each dimension of one vector in one line separated by tabs for users and stores.

users.tsv


-0.32609        0.0670668       -0.0722714      -0.0738026      0.0177741 ....
...

restaurants.tsv


0.0385331       0.0978981       -0.0495091      -0.182571       0.0538142 ...
...

This is made by writing out the result of doc2vec above.

Then create a DB with the following command.

$ ngt create -d 128 users users.tsv
$ ngt create -d 128 restaurants restaurants.tsv

Then you should have the following directory.

<cur dir>
|--restaurants
|   |-- grp
|   |-- obj
|   |-- prf
|   |-- tre
|
|--users
    |-- grp
    |-- obj
    |-- prf
    |-- tre

Now you are ready to go. The search is as follows.

#When searching for users
$ ngt search -n 10 users search_query.tsv

#When searching for a store
$ ngt search -n 10 restaurants search_query.tsv

# search_query.tsv is written with the target vector as one line and each dimension of the vector separated by tabs.
#If you are looking for a store close to me, search_query.Write my user vector in tsv.

Stores close to users

Let's start by using me as a laboratory table and recommending stores that I recommend to those that I haven't written a review for.

Ranking url
1 Shamrock by Abbott Choice
2 Craft heads
3 Abbot Choice Shibuya
4 Swan Lake Pub Ed
5 Craft Beer Market Jimbocho
6 Cooper Ales
7 Bashamichi Tap Room
8 8taps
9 Bungalow
10 The Shannon's

I call myself Retty's craft beer manager, but it's just a beer shop. It's certainly my favorite, so I think it's working.

Especially for Craft Heads, I haven't written a word of mouth, but I've been through it for a long time. Then I'm talking about writing a word of mouth. Furthermore, it is only the carriageway tap room that I have never been to. Sorry for the word-of-mouth writing: P

Store close to the store

The Craft Heads mentioned above is my favorite store, so try to find a store close to that store.

Ranking url
1 The Griffon
2 Good Beer Fausets
3 Beer Pub Camden
4 Vivo!Beer and Dining Bar
5 Watering hole
6 Brewdog Roppongi
7 Nakameguro Tap Room
8 TAP STAND
9 Meguro Republic
10 Burgon Dise Heimel

Yeah, it's just a beer shop. By the way, I've been to all of them. It seems that the beer shop is properly recommended.

Users close to users

I'm not going to put out users other than Retty employees, so skip here. By the way, when I put out a user close to me, there were certainly beer enthusiasts.

Users close to the store

This is also a user, so I will skip it.

Summary

Is it possible to obtain a vector representing users / stores by adding average of word-of-mouth doc2vec? I tried it based on the hypothesis, but it seems to work quite well. I used mecab to extract the word this time, but sentencepiece seems to be quite interesting, so I'm thinking of using it this time.

P.S.

In the above, I wrote that the added average of the word-of-mouth vector is the user vector / store vector, but in fact, I have devised a little more.

At first, it was just an averaging, but when I tried it in-house, I was told that my old hobbies were reflected too much. So, if you give more weight to recent reviews, it is said that there will be no stores that match your hobbies for those who are running out of hobby stores and have many posts other than hobby stores these days ... Since there were various things, I am repeating fine adjustments. The above result is the result after the fine adjustment.

Recommended Posts

Construction of recommendation system using word-of-mouth doc2vec
Recommendation of data analysis using MessagePack
Implementation of dialogue system using Chainer [seq2seq]
Recommendation of Poetry
Types of recommendation systems
Example of using lambda
Implementation of recommendation system ~ I tried to find the similarity from the outline of the movie using TF-IDF ~
Memo of Linux environment construction using VirtualBox + Vagrant on Windows 10
Construction of Cortex-M development environment for TOPPERS using Raspberry Pi