[PYTHON] [Job change meeting] Try to classify companies by processing word-of-mouth in natural language with word2vec

Introduction

My name is @ naotaka1128 and I am in charge of the 20th day of LivesenseAdventCalendar 2016. Currently, I am in charge of a data analyst for a job change word-of-mouth service called a job change meeting.

Job Change Conference is Japan's largest job change word-of-mouth service that collects millions of company reviews. Currently, we only display word-of-mouth and scores, but in the future, we would like to analyze word-of-mouth by natural language processing and provide useful information that was not available until now. We are here.

This time, as a starting point, I will analyze word-of-mouth using natural language processing technologies such as word2vec and doc2vec, and classify companies.

Natural language processing technology to use

word2vec Recently, a natural language processing technology called word2vec has become a hot topic. As many of you may know, we use a large amount of sentences to quantify words in vector representation, and enable the following calculation between words.

--King --Man + Woman = Queen --Paris --France + Japan = Tokyo

doc2vec There is also doc2vec, which extends word2vec and enables the above calculations in the document itself. Roughly speaking, it's like adding up the vector of words acquired by word2vec.

With doc2vec, it is possible to calculate the similarity between different documents by digitizing the documents. I will consider the word-of-mouth of the company as one document and analyze it with doc2vec to analyze the relationship between the companies.

Technology to use

Python
gensim
- word2vec / doc2vec --In addition, a natural language processing technology called a topic model is also implemented. --Previously, I did Try to classify ramen shops by natural language processing using a topic model. Please take a look if you are interested.
MeCab

flow

We will follow the flow below.

Morphological analysis of company reviews
Build a model of doc2vec with gensim -The word2vec model is built at the same time as the doc2vec model is built.
With the built model 3-1. Word2vec in the word that appeared in the word of mouth 3-2. Calculation of similarity between companies with doc2vec 3-3. Addition and subtraction between companies with doc2vec

1. Morphological analysis of company reviews

It is almost the same as Classify ramen shops by natural language processing.

#Reading word-of-mouth data
from io_modules import load_data  #Self-made DB read library
rows = load_data(LOAD_QUERY, KUCHIKOMI_DB)  # [company name,Word-of-mouth communication]

#Extract the stem with the stem function in the reference article
from utils import stems  #Implementation of reference article Almost as it is
companies = [row[0] for row in rows]
docs = [stems(row[1]) for row in rows]

"""
I am making the following data
companies = ['Black Company Co., Ltd.', 'Rewarding Co., Ltd.', ...]
docs = [
  ['Rewarding', 'not enough', 'overtime', 'very much', 'Many', ...
  ['Midnight overtime', 'Natural', 'Spicy', 'I want to die', 'Impossible', ...
   ...
]
"""

As a whole, just preprocessing
Stem extraction: I referred to this article
Remarks: No special dictionary is prepared (only using neologd)

2. Build a model of doc2vec with gensim

By the way, here is the key to natural language processing.

I'd like to say, but I don't do much because I just call the library. In addition, the calculation was quick, and it ended without a hitch.

#Library load
from gensim import models

#Register reviews on gensim
#I am using the extension class implemented in the reference article to give the company name to the word-of-mouth
sentences = LabeledListSentence(docs, companies)

#Learning condition setting of doc2vec
# alpha:Learning rate/ min_count:Ignore words that appear less than X times
# size:Number of dimensions of vector/ iter:Number of iterations/ workers:Number of parallel executions
model = models.Doc2Vec(alpha=0.025, min_count=5,
                       size=100, iter=20, workers=4)

#Preparation for doc2vec(Word list construction)
model.build_vocab(sentences)

#You can also use it by forcibly applying the word vector learned from Wikipedia.
# model.intersect_word2vec_format('./data/wiki/wiki2vec.bin', binary=True)

#Learning execution
model.train(sentences)

#save
model.save('./data/doc2vec.model')

#After training, the model can be loaded from a file
# model = models.Doc2Vec.load('./data/doc2vec.model')

#The order may change, so the company list will be recalled after learning.
companies = model.docvecs.offset2doctag

The implementation of LabeledListSentence is as follows:

#Reference article: http://qiita.com/okappy/items/32a7ba7eddf8203c9fa1
class LabeledListSentence(object):
    def __init__(self, words_list, labels):
        self.words_list = words_list
        self.labels = labels

    def __iter__(self):
        for i, words in enumerate(self.words_list):
            yield models.doc2vec.LabeledSentence(words, ['%s' % self.labels[i]])

3-1. Word2vec in the word that appeared in the word of mouth

Well, the model was built in no time.

If the accuracy of word2vec is bad, the result of doc2vec will inevitably be bad, so While playing with word2vec ~~, we will check the accuracy.

Similar words

First of all, let's go from around "overtime", which is very popular for job change reviews.

# model.most_similar(positive=[word]) で似ているwordが出せる
>> model.most_similar(positive=['overtime'])
[('overtime work', 0.8757208585739136),
 ('Service overtime', 0.8720364570617676),
 ('Wage theft', 0.7500427961349487),
 ('Overtime fee', 0.6272672414779663),
 ('reverberation', 0.6267948746681213),
 ('Holiday work', 0.5998174548149109),
 ('Long working hours', 0.5923150777816772),
 ('workload', 0.5819833278656006),
 ('Overtime', 0.5778118371963501),
 ('overtime pay', 0.5598958730697632)]

Similar words are lined up ...!

It's amazing to pick up abbreviations such as "wage theft" and typos such as "reverberation". The computer understood the concept of "overtime"! It was a deeply emotional moment.

I want users to change jobs positively every day, so Let's also check positive words.

>> model.most_similar(positive=['Rewarding'])
[('Kai', 0.9375230073928833),
 ('The real thrill', 0.7799979448318481),
 ('interesting', 0.7788150310516357),
 ('Interesting', 0.7710426449775696),
 ('pleasant', 0.712959885597229),
 ('reason to live', 0.6919904351234436),
 ('Interesting', 0.6607719659805298),
 ('joy', 0.6537446975708008),
 ('pride', 0.6432669162750244),
 ('Boring', 0.6373245120048523)]

There is a word that I'm curious about at the end, but it's a good feeling. It seems that the word itself is understood, so the next step is to add or subtract the word.

Addition and subtraction of words

Looking at job change reviews, the content related to the ease of working for women is popular. Try something like "single woman-female + male =?".

First, check the basic word comprehension.

#Words that resemble women
>> model.most_similar(positive=['Female'])
[('Female employee', 0.8745297789573669),
 ('a working woman', 0.697405219078064),
 ('Single woman', 0.6827554106712341),
 ('Woman', 0.5963315963745117)]

#Words that resemble men
>> model.most_similar(positive=['male'])
[('Management', 0.7058243751525879),
 ('Active', 0.6625881195068359),
 ('Special treatment', 0.6411184668540955),
 ('Preferential treatment', 0.5910355448722839)]

#Words that resemble single women
>> model.most_similar(positive=['Single woman'])
[('Female employee', 0.7283456325531006),
 ('Single mother', 0.6969124674797058),
 ('Unmarried', 0.6945561170578003),
 ('Female', 0.6827554106712341)]

It seems to be okay, so execute addition and subtraction.

#Single woman-Female+male= ?
# model.most_similar(positive=[Words to add], negative=[Word to draw])
>> model.most_similar(positive=['Single woman', 'male'], negative=['Female'])
[('Unmarried', 0.665600597858429),
 ('Management', 0.6068357825279236),
 ('Having children', 0.58555006980896),
 ('Boys', 0.530462384223938),
 ('Special treatment', 0.5190619230270386)]

Despite some suspicious results, the feminine words such as "female employee" and "single mother" disappeared from "single woman", and "unmarried" was presumed to be the most similar word.

It seems that word2vec can be done correctly as it is, so We will continue to classify companies.

3-2. Calculation of similarity between companies with doc2vec

Next, let's examine the relationships between companies.

First, check with a simple company that has many affiliated companies.

# model.docvecs.most_similar(positive=[Base company ID])
# ID 53 :Recruit Holdings
>> model.docvecs.most_similar(positive=[53])
[('Recruit Lifestyle Co., Ltd.', 0.9008421301841736),
 ('Recruit Jobs Co., Ltd.', 0.8883105516433716),
 ('Recruit Carrier Co., Ltd.', 0.8839867115020752),
 ('Recruit Housing Company Co., Ltd.', 0.8076469898223877),
 ('Recruit Communications Co., Ltd.', 0.7945607900619507),
 ('Career Design Center Co., Ltd.', 0.7822821140289307),
 ('En-Japan Co., Ltd.', 0.782017707824707),
 ('Recruit Marketing Partners Co., Ltd.', 0.7807818651199341),
 ('CyberAgent, Inc.', 0.7434782385826111),
 ('Quick Co., Ltd.', 0.7397039532661438)]

It seems too easy, but as a company similar to Recruit Holdings, Recruit affiliated companies have come out.

It looks okay, so let's take a look at some common companies.

# ID 1338 : DeNA
>> model.docvecs.most_similar(positive=[1338])
[('GREE, Inc.', 0.8263522386550903),
 ('CyberAgent, Inc.', 0.8176108598709106),
 ('Drecom Co., Ltd.', 0.7977319955825806),
 ('Speee, Inc.', 0.787316083908081),
 ('CYBIRD Co., Ltd.', 0.7823044061660767),
 ('Dwango Co., Ltd.', 0.767551064491272),
 ('Yahoo Japan Corporation', 0.7610974907875061),
 ('KLab Co., Ltd.', 0.7593647837638855),
 ('Gloops Co., Ltd.', 0.7475718855857849),
 ('ＮＨＮ\u3000comico Co., Ltd.', 0.7439380288124084)]

Mr. DeNA, who has been a hot topic recently, came out that he is similar to Mr. Gree. It seems that it was judged to be related to games, and Cyber and Drecom also appeared.

It looks pretty good. If only Web companies are used, the results may be biased, so I also look at companies that are completely different.

# ID 862 :Honda
>> model.docvecs.most_similar(positive=[862])
[('Toyota Motor Corporation', 0.860333263874054),
 ('Mazda Corporation Inc', 0.843244194984436),
 ('Denso Co., Ltd.', 0.8296780586242676),
 ('Fuji Heavy Industries Ltd.', 0.8261093497276306),
 ('Hino Motors Co., Ltd.', 0.8115691542625427),
 ('Nissan Motor Co., Ltd', 0.8105560541152954),
 ('Daihatsu Motor Co., Ltd.', 0.8088374137878418),
 ('Aisin Seiki Co., Ltd.', 0.8074800372123718),
 ('Honda R & D Co., Ltd.', 0.7952905893325806),
 ('Toyota Industries Corporation', 0.7946352362632751)]

# ID 38 :Sony
>> model.docvecs.most_similar(positive=[38])
[('Panasonic Corporation', 0.8186650276184082),
 ('Toshiba Corporation', 0.7851587533950806),
 ('OMRON Corporation', 0.7402874231338501),
 ('NEC', 0.7391767501831055),
 ('Nikon Corporation', 0.7331269383430481),
 ('Sony Global Manufacturing & Operations Corporation', 0.7183523178100586),
 ('Taiyo Yuden Co., Ltd.', 0.7149790525436401),
 ('Sharp Corporation', 0.7115868330001831),
 ('Pioneer Corporation', 0.7104746103286743),
 ('Canon Inc', 0.7103182077407837)]

# ID 1688 :McKinsey(Consulting firm)
>> model.docvecs.most_similar(positive=[1688])
[('Accenture Co., Ltd.', 0.7885801196098328),
 ('Boston Consulting Group Co., Ltd.', 0.7835338115692139),
 ('Goldman Sachs Securities Co., Ltd.', 0.7507193088531494),
 ('Deloitte Tohmatsu Consulting LLC', 0.7278151512145996),
 ('SIGMAXYZ Co., Ltd.', 0.6909163594245911),
 ('PwC Advisory LLC', 0.6522221565246582),
 ('Link and Motivation Co., Ltd.', 0.6289964914321899),
 ('Morgan Stanley MUFG Securities Co., Ltd.', 0.6283067464828491),
 ('EY Advisory Co., Ltd.', 0.6275663375854492),
 ('ABeam Consulting Co., Ltd.', 0.6181442737579346)]

It looks like it's generally okay.

Since the similarity between companies can be calculated in this way (= distance can be calculated), The following analysis can be easily performed.

--Categories of companies using clustering methods such as K-means --Visualize the distribution of companies using methods such as multidimensional scaling

The above process is very easy to implement with scikit-learn.

This time, I actually tried to visualize the distribution by the multidimensional scaling method, If you write the contents, this article will be very long, so I would like to introduce it at another time.

3-3. Addition and subtraction between companies with doc2vec

Like word2vec, doc2vec can add and subtract documents. Let's do it for the time being.

As mentioned earlier, the companies that resemble Recruit Holdings were Recruit companies.

# ID 53:Companies similar to Recruit Holdings(Repost)
>> model.docvecs.most_similar(positive=[53])
[('Recruit Lifestyle Co., Ltd.', 0.9008421301841736),
 ('Recruit Jobs Co., Ltd.', 0.8883105516433716),
 ('Recruit Carrier Co., Ltd.', 0.8839867115020752),
 ('Recruit Housing Company Co., Ltd.', 0.8076469898223877),
 ('Recruit Communications Co., Ltd.', 0.7945607900619507),
 ('Career Design Center Co., Ltd.', 0.7822821140289307),
 ('En-Japan Co., Ltd.', 0.782017707824707),
 ('Recruit Marketing Partners Co., Ltd.', 0.7807818651199341),
 ('CyberAgent, Inc.', 0.7434782385826111),
 ('Quick Co., Ltd.', 0.7397039532661438)]

Here, "Job change information DODA", "Part-time job information an", etc. are operated. Let's add the intelligence of a major human resources company.

# model.docvecs.most_similar(positive=[Base company ID,Add more than one])
# 「ID 53:Recruit Holdings "+ 「ID 110:Intelligence "= ？
>> model.docvecs.most_similar(positive=[53, 110])
[('Recruit Carrier Co., Ltd.', 0.888693630695343),
 ('Recruit Jobs Co., Ltd.', 0.865821123123169),
 ('Recruit Lifestyle Co., Ltd.', 0.8580507636070251),
 ('Career Design Center Co., Ltd.', 0.8396339416503906),
 ('En-Japan Co., Ltd.', 0.8285592794418335),
 ('Mynavi Corporation', 0.7874248027801514),
 ('Quick Co., Ltd.', 0.777060866355896),
 ('Recruit Housing Company Co., Ltd.', 0.775804877281189),
 ('CyberAgent, Inc.', 0.7625365257263184),
 ('Neo Career Co., Ltd.', 0.758436381816864)]

The above results can be considered as follows.

Among recruiting companies, two human resources companies have risen to the top.
Recruit Career operates "Rikunabi" and "Rikunabi NEXT"
Recruit Jobs operates "Townwork" and "Torabayu"
Human resources companies have risen to the top, overtaking recruitment affiliates other than human resources companies
Career Design Center operates @type etc.
En-Japan manages En-change jobs

It feels like I've given an arbitrary example that is fairly easy to understand, It seems that it is okay to judge that it is going well as it is.

Summary and future issues

In this article, we have achieved the following contents using the reviews of the career change meeting.

--Word2vec made the machine understand the concept of words appearing in word-of-mouth, and calculated the similarity of words and added / subtracted. --By doc2vec, let the machine understand the outline of the company, and the company's (the same sentence below)

In the future, it will be possible to perform calculations such as "word + word => similar companies" (example: rewarding + growth => livesense), and companies with a corporate culture that users like and [Jobs](https: // career. I would like to try a technology that can search jobtalk.jp/).

However, at present, there is one fatal issue, so I will briefly introduce it at the end. Below is an easy-to-understand example.

#What words are similar to "black"?
>> model.most_similar(positive=['black'])
[('Black company', 0.8150135278701782),
 ('white', 0.7779906392097473),
 ('White company', 0.6732245683670044),
 ('Black company', 0.5990744829177856),
 ('Black black', 0.5734715461730957),
 ('Famous', 0.563334584236145),
 ('clean', 0.5561092495918274),
 ('gray', 0.5449624061584473),
 ('Nightless castle', 0.5446360111236572),
 ('Religious groups', 0.5327660441398621)]

As shown in this example, the simple method introduced this time recognizes "black" and "white" as similar words.

Words used in the same context are regarded as the same, and their polarity cannot be determined, and the machine seems to recognize the following.

――I understand you're talking about overtime ――I'm not sure if there is a lot of overtime or a little overtime

Actually, in order to output the result of "word + word => similar company", I modified the gensim library and extended it.

However, I thought it would be too dishonest to announce it because of such issues and the accuracy of the results cannot be guaranteed, so I did not introduce it this time. (I was preparing a hard example like "Livesense + Overtime + Hard => ??" ...!)

This sense of challenge exists regardless of Japanese, and it seems that various studies are progressing in the world. Among them, a study that the result changes considerably when word2vec is learned after considering the dependency (Reference There seems to be /)), and I would like to make such efforts in the future.