My name is @ naotaka1128 and I am in charge of the 20th day of LivesenseAdventCalendar 2016. Currently, I am in charge of a data analyst for a job change word-of-mouth service called a job change meeting.
Job Change Conference is Japan's largest job change word-of-mouth service that collects millions of company reviews. Currently, we only display word-of-mouth and scores, but in the future, we would like to analyze word-of-mouth by natural language processing and provide useful information that was not available until now. We are here.
This time, as a starting point, I will analyze word-of-mouth using natural language processing technologies such as word2vec and doc2vec, and classify companies.
word2vec Recently, a natural language processing technology called word2vec has become a hot topic. As many of you may know, we use a large amount of sentences to quantify words in vector representation, and enable the following calculation between words.
--King --Man + Woman = Queen --Paris --France + Japan = Tokyo
doc2vec There is also doc2vec, which extends word2vec and enables the above calculations in the document itself. Roughly speaking, it's like adding up the vector of words acquired by word2vec.
With doc2vec, it is possible to calculate the similarity between different documents by digitizing the documents. I will consider the word-of-mouth of the company as one document and analyze it with doc2vec to analyze the relationship between the companies.
We will follow the flow below.
It is almost the same as Classify ramen shops by natural language processing.
#Reading word-of-mouth data
from io_modules import load_data #Self-made DB read library
rows = load_data(LOAD_QUERY, KUCHIKOMI_DB) # [company name,Word-of-mouth communication]
#Extract the stem with the stem function in the reference article
from utils import stems #Implementation of reference article Almost as it is
companies = [row[0] for row in rows]
docs = [stems(row[1]) for row in rows]
"""
I am making the following data
companies = ['Black Company Co., Ltd.', 'Rewarding Co., Ltd.', ...]
docs = [
['Rewarding', 'not enough', 'overtime', 'very much', 'Many', ...
['Midnight overtime', 'Natural', 'Spicy', 'I want to die', 'Impossible', ...
...
]
"""
By the way, here is the key to natural language processing.
I'd like to say, but I don't do much because I just call the library. In addition, the calculation was quick, and it ended without a hitch.
#Library load
from gensim import models
#Register reviews on gensim
#I am using the extension class implemented in the reference article to give the company name to the word-of-mouth
sentences = LabeledListSentence(docs, companies)
#Learning condition setting of doc2vec
# alpha:Learning rate/ min_count:Ignore words that appear less than X times
# size:Number of dimensions of vector/ iter:Number of iterations/ workers:Number of parallel executions
model = models.Doc2Vec(alpha=0.025, min_count=5,
size=100, iter=20, workers=4)
#Preparation for doc2vec(Word list construction)
model.build_vocab(sentences)
#You can also use it by forcibly applying the word vector learned from Wikipedia.
# model.intersect_word2vec_format('./data/wiki/wiki2vec.bin', binary=True)
#Learning execution
model.train(sentences)
#save
model.save('./data/doc2vec.model')
#After training, the model can be loaded from a file
# model = models.Doc2Vec.load('./data/doc2vec.model')
#The order may change, so the company list will be recalled after learning.
companies = model.docvecs.offset2doctag
The implementation of LabeledListSentence is as follows:
#Reference article: http://qiita.com/okappy/items/32a7ba7eddf8203c9fa1
class LabeledListSentence(object):
def __init__(self, words_list, labels):
self.words_list = words_list
self.labels = labels
def __iter__(self):
for i, words in enumerate(self.words_list):
yield models.doc2vec.LabeledSentence(words, ['%s' % self.labels[i]])
Well, the model was built in no time.
If the accuracy of word2vec is bad, the result of doc2vec will inevitably be bad, so While playing with word2vec ~~, we will check the accuracy.
First of all, let's go from around "overtime", which is very popular for job change reviews.
# model.most_similar(positive=[word]) で似ているwordが出せる
>> model.most_similar(positive=['overtime'])
[('overtime work', 0.8757208585739136),
('Service overtime', 0.8720364570617676),
('Wage theft', 0.7500427961349487),
('Overtime fee', 0.6272672414779663),
('reverberation', 0.6267948746681213),
('Holiday work', 0.5998174548149109),
('Long working hours', 0.5923150777816772),
('workload', 0.5819833278656006),
('Overtime', 0.5778118371963501),
('overtime pay', 0.5598958730697632)]
Similar words are lined up ...!
It's amazing to pick up abbreviations such as "wage theft" and typos such as "reverberation". The computer understood the concept of "overtime"! It was a deeply emotional moment.
I want users to change jobs positively every day, so Let's also check positive words.
>> model.most_similar(positive=['Rewarding'])
[('Kai', 0.9375230073928833),
('The real thrill', 0.7799979448318481),
('interesting', 0.7788150310516357),
('Interesting', 0.7710426449775696),
('pleasant', 0.712959885597229),
('reason to live', 0.6919904351234436),
('Interesting', 0.6607719659805298),
('joy', 0.6537446975708008),
('pride', 0.6432669162750244),
('Boring', 0.6373245120048523)]
There is a word that I'm curious about at the end, but it's a good feeling. It seems that the word itself is understood, so the next step is to add or subtract the word.
Looking at job change reviews, the content related to the ease of working for women is popular. Try something like "single woman-female + male =?".
First, check the basic word comprehension.
#Words that resemble women
>> model.most_similar(positive=['Female'])
[('Female employee', 0.8745297789573669),
('a working woman', 0.697405219078064),
('Single woman', 0.6827554106712341),
('Woman', 0.5963315963745117)]
#Words that resemble men
>> model.most_similar(positive=['male'])
[('Management', 0.7058243751525879),
('Active', 0.6625881195068359),
('Special treatment', 0.6411184668540955),
('Preferential treatment', 0.5910355448722839)]
#Words that resemble single women
>> model.most_similar(positive=['Single woman'])
[('Female employee', 0.7283456325531006),
('Single mother', 0.6969124674797058),
('Unmarried', 0.6945561170578003),
('Female', 0.6827554106712341)]
It seems to be okay, so execute addition and subtraction.
#Single woman-Female+male= ?
# model.most_similar(positive=[Words to add], negative=[Word to draw])
>> model.most_similar(positive=['Single woman', 'male'], negative=['Female'])
[('Unmarried', 0.665600597858429),
('Management', 0.6068357825279236),
('Having children', 0.58555006980896),
('Boys', 0.530462384223938),
('Special treatment', 0.5190619230270386)]
Despite some suspicious results, the feminine words such as "female employee" and "single mother" disappeared from "single woman", and "unmarried" was presumed to be the most similar word.
It seems that word2vec can be done correctly as it is, so We will continue to classify companies.
Next, let's examine the relationships between companies.
First, check with a simple company that has many affiliated companies.
# model.docvecs.most_similar(positive=[Base company ID])
# ID 53 :Recruit Holdings
>> model.docvecs.most_similar(positive=[53])
[('Recruit Lifestyle Co., Ltd.', 0.9008421301841736),
('Recruit Jobs Co., Ltd.', 0.8883105516433716),
('Recruit Carrier Co., Ltd.', 0.8839867115020752),
('Recruit Housing Company Co., Ltd.', 0.8076469898223877),
('Recruit Communications Co., Ltd.', 0.7945607900619507),
('Career Design Center Co., Ltd.', 0.7822821140289307),
('En-Japan Co., Ltd.', 0.782017707824707),
('Recruit Marketing Partners Co., Ltd.', 0.7807818651199341),
('CyberAgent, Inc.', 0.7434782385826111),
('Quick Co., Ltd.', 0.7397039532661438)]
It seems too easy, but as a company similar to Recruit Holdings, Recruit affiliated companies have come out.
It looks okay, so let's take a look at some common companies.
# ID 1338 : DeNA
>> model.docvecs.most_similar(positive=[1338])
[('GREE, Inc.', 0.8263522386550903),
('CyberAgent, Inc.', 0.8176108598709106),
('Drecom Co., Ltd.', 0.7977319955825806),
('Speee, Inc.', 0.787316083908081),
('CYBIRD Co., Ltd.', 0.7823044061660767),
('Dwango Co., Ltd.', 0.767551064491272),
('Yahoo Japan Corporation', 0.7610974907875061),
('KLab Co., Ltd.', 0.7593647837638855),
('Gloops Co., Ltd.', 0.7475718855857849),
('NHN\u3000comico Co., Ltd.', 0.7439380288124084)]
Mr. DeNA, who has been a hot topic recently, came out that he is similar to Mr. Gree. It seems that it was judged to be related to games, and Cyber and Drecom also appeared.
It looks pretty good. If only Web companies are used, the results may be biased, so I also look at companies that are completely different.
# ID 862 :Honda
>> model.docvecs.most_similar(positive=[862])
[('Toyota Motor Corporation', 0.860333263874054),
('Mazda Corporation Inc', 0.843244194984436),
('Denso Co., Ltd.', 0.8296780586242676),
('Fuji Heavy Industries Ltd.', 0.8261093497276306),
('Hino Motors Co., Ltd.', 0.8115691542625427),
('Nissan Motor Co., Ltd', 0.8105560541152954),
('Daihatsu Motor Co., Ltd.', 0.8088374137878418),
('Aisin Seiki Co., Ltd.', 0.8074800372123718),
('Honda R & D Co., Ltd.', 0.7952905893325806),
('Toyota Industries Corporation', 0.7946352362632751)]
# ID 38 :Sony
>> model.docvecs.most_similar(positive=[38])
[('Panasonic Corporation', 0.8186650276184082),
('Toshiba Corporation', 0.7851587533950806),
('OMRON Corporation', 0.7402874231338501),
('NEC', 0.7391767501831055),
('Nikon Corporation', 0.7331269383430481),
('Sony Global Manufacturing & Operations Corporation', 0.7183523178100586),
('Taiyo Yuden Co., Ltd.', 0.7149790525436401),
('Sharp Corporation', 0.7115868330001831),
('Pioneer Corporation', 0.7104746103286743),
('Canon Inc', 0.7103182077407837)]
# ID 1688 :McKinsey(Consulting firm)
>> model.docvecs.most_similar(positive=[1688])
[('Accenture Co., Ltd.', 0.7885801196098328),
('Boston Consulting Group Co., Ltd.', 0.7835338115692139),
('Goldman Sachs Securities Co., Ltd.', 0.7507193088531494),
('Deloitte Tohmatsu Consulting LLC', 0.7278151512145996),
('SIGMAXYZ Co., Ltd.', 0.6909163594245911),
('PwC Advisory LLC', 0.6522221565246582),
('Link and Motivation Co., Ltd.', 0.6289964914321899),
('Morgan Stanley MUFG Securities Co., Ltd.', 0.6283067464828491),
('EY Advisory Co., Ltd.', 0.6275663375854492),
('ABeam Consulting Co., Ltd.', 0.6181442737579346)]
It looks like it's generally okay.
Since the similarity between companies can be calculated in this way (= distance can be calculated), The following analysis can be easily performed.
--Categories of companies using clustering methods such as K-means --Visualize the distribution of companies using methods such as multidimensional scaling
The above process is very easy to implement with scikit-learn.
This time, I actually tried to visualize the distribution by the multidimensional scaling method, If you write the contents, this article will be very long, so I would like to introduce it at another time.
Like word2vec, doc2vec can add and subtract documents. Let's do it for the time being.
As mentioned earlier, the companies that resemble Recruit Holdings were Recruit companies.
# ID 53:Companies similar to Recruit Holdings(Repost)
>> model.docvecs.most_similar(positive=[53])
[('Recruit Lifestyle Co., Ltd.', 0.9008421301841736),
('Recruit Jobs Co., Ltd.', 0.8883105516433716),
('Recruit Carrier Co., Ltd.', 0.8839867115020752),
('Recruit Housing Company Co., Ltd.', 0.8076469898223877),
('Recruit Communications Co., Ltd.', 0.7945607900619507),
('Career Design Center Co., Ltd.', 0.7822821140289307),
('En-Japan Co., Ltd.', 0.782017707824707),
('Recruit Marketing Partners Co., Ltd.', 0.7807818651199341),
('CyberAgent, Inc.', 0.7434782385826111),
('Quick Co., Ltd.', 0.7397039532661438)]
Here, "Job change information DODA", "Part-time job information an", etc. are operated. Let's add the intelligence of a major human resources company.
# model.docvecs.most_similar(positive=[Base company ID,Add more than one])
# 「ID 53:Recruit Holdings "+ 「ID 110:Intelligence "= ?
>> model.docvecs.most_similar(positive=[53, 110])
[('Recruit Carrier Co., Ltd.', 0.888693630695343),
('Recruit Jobs Co., Ltd.', 0.865821123123169),
('Recruit Lifestyle Co., Ltd.', 0.8580507636070251),
('Career Design Center Co., Ltd.', 0.8396339416503906),
('En-Japan Co., Ltd.', 0.8285592794418335),
('Mynavi Corporation', 0.7874248027801514),
('Quick Co., Ltd.', 0.777060866355896),
('Recruit Housing Company Co., Ltd.', 0.775804877281189),
('CyberAgent, Inc.', 0.7625365257263184),
('Neo Career Co., Ltd.', 0.758436381816864)]
The above results can be considered as follows.
It feels like I've given an arbitrary example that is fairly easy to understand, It seems that it is okay to judge that it is going well as it is.
In this article, we have achieved the following contents using the reviews of the career change meeting.
--Word2vec made the machine understand the concept of words appearing in word-of-mouth, and calculated the similarity of words and added / subtracted. --By doc2vec, let the machine understand the outline of the company, and the company's (the same sentence below)
In the future, it will be possible to perform calculations such as "word + word => similar companies" (example: rewarding + growth => livesense), and companies with a corporate culture that users like and [Jobs](https: // career. I would like to try a technology that can search jobtalk.jp/).
However, at present, there is one fatal issue, so I will briefly introduce it at the end. Below is an easy-to-understand example.
#What words are similar to "black"?
>> model.most_similar(positive=['black'])
[('Black company', 0.8150135278701782),
('white', 0.7779906392097473),
('White company', 0.6732245683670044),
('Black company', 0.5990744829177856),
('Black black', 0.5734715461730957),
('Famous', 0.563334584236145),
('clean', 0.5561092495918274),
('gray', 0.5449624061584473),
('Nightless castle', 0.5446360111236572),
('Religious groups', 0.5327660441398621)]
As shown in this example, the simple method introduced this time recognizes "black" and "white" as similar words.
Words used in the same context are regarded as the same, and their polarity cannot be determined, and the machine seems to recognize the following.
――I understand you're talking about overtime ――I'm not sure if there is a lot of overtime or a little overtime
Actually, in order to output the result of "word + word => similar company", I modified the gensim library and extended it.
However, I thought it would be too dishonest to announce it because of such issues and the accuracy of the results cannot be guaranteed, so I did not introduce it this time. (I was preparing a hard example like "Livesense + Overtime + Hard => ??" ...!)
This sense of challenge exists regardless of Japanese, and it seems that various studies are progressing in the world. Among them, a study that the result changes considerably when word2vec is learned after considering the dependency (Reference There seems to be /)), and I would like to make such efforts in the future.