[PYTHON] [Word2vec] Let's visualize the result of natural language processing of company reviews

Introduction

This article is a continuation of [Job Change Conference] Try to classify companies by processing word-of-mouth in natural language with word2vec.

Last time, I wrote that I was able to look up similar companies and words as a result of natural language processing of word-of-mouth of the job change meeting with word2vec, but this time I will visualize the result.

Caution

As I wrote in the previous article, this method had the drawback that "I understand that word-of-mouth talks about overtime, but I don't know if there are more or less."

This visualization has not fixed the defect, so I hope you can see it as a visualization sample.

Also, last time I wrote it on the company's Advent calendar, but this time I wrote it as an individual, so the content of this article has nothing to do with the views of the organization to which I belong.

Table of contents

What to use

Load the model learned last time

[Here in the previous article](http://qiita.com/naotaka1128/items/2c4551abfd40e43b0146#2-gensim-%E3%81%A7-doc2vec-%E3%81%AE%E3%83%A2%E3% Read the model saved by 83% 87% E3% 83% AB% E6% A7% 8B% E7% AF% 89).

model = models.Doc2Vec.load('./data/doc2vec.model')

Play with words ① Try to write a distribution map

I defined the method to write the distribution map as follows.

Usually, the vector representation of a word is trained in a model in 100 or 300 dimensions. Visualization is performed after compressing it in dimension and dropping it into two dimensions.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

def draw_word_scatter(word, topn=30):
    """A method for drawing a distribution map of words that are similar to the entered word"""

    #Use the following features of Gensim word2vec to find similar words
    # model.most_similar(word, topn=topn)
    words = [x[0] for x in sorted(model.most_similar(word, topn=topn))]
    words.append(word)

    #Find the vector representation of each word. Gensim most_Based on similar
    #A method that returns a vector of words(model.calc_vec)Is defined
    #The implementation is described at the end of this article as it will be long.
    vecs = [model.calc_vec(word) for word in words]

    #Distribution map
    draw_scatter_plot(vecs, words)

def draw_scatter_plot(vecs, tags, clusters)
    """Scatter plot based on the input vector(With label)Method for drawing"""

    # Scikit-Dimensionality reduction and its visualization by learn PCA
    pca = PCA(n_components=2)
    coords = pca.fit_transform(vecs)

    #Visualization with matplotlib
    fig, ax = plt.subplots()
    x = [v[0] for v in coords]
    y = [v[1] for v in coords]

    #Consider the cluster if a cluster for each point is set
    #Error handling is appropriate
    if clusters:
        ax.scatter(x, y, c=clusters)
    else:
        ax.scatter(x, y)

    for i, txt in enumerate(tags):
        ax.annotate(txt, (coords[i][0], coords[i][1]))
    plt.show()

I will draw a distribution map when I am ready.

# "overtime"Visualize words that resemble
draw_word_scatter('overtime', topn=40)
残業.png

The result was something I couldn't see without tears.

The area where the morning return, morning, last train, and overtime work are gathered in the middle is especially miserable. Even more scary, "sleep" is plotted farthest from the area. I cannot help feeling the melancholy of office workers and the danger of death from overwork ...

I'm a little lonely, so I'll try even positive words.

# "Rewarding"Visualize words that resemble
draw_word_scatter('Rewarding')
やりがい.png

It's very different from the previous distribution map ...! It's good to be proud, rewarding, and give dreams. By the way, we are looking for comrades who will accomplish a rewarding job together.

Play with words ② Try to draw a tree diagram

If you run a website, you may want to group words.

Personally, I think that one of the reasons why WELQ and MERY were overwhelmingly strong in SEO was the influence of proper layering of tags and grouping. It can also be used for such things, and it would be nice to create a landing page by automatically classifying the inflow keywords in listing ads.

Here, let's draw a tree diagram for proper stratification and grouping.

import pandas as pd
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, dendrogram

def draw_similar_word_dendrogram(word, topn=30):
    """A method for drawing a tree diagram of words that resemble the entered word"""

    # draw_word_Same as scatter(I wrote it redundantly for the qiita article)
    words = [x[0] for x in sorted(model.most_similar(word, topn=topn))]
    words.append(word)
    vecs = [model.calc_vec(word) for word in words]

    #Visualization using SciPy functions
    #I referred to the code of Python machine learning professional
    df = pd.DataFrame(vecs, index=words)
    row_clusters = linkage(pdist(df, metric='euclidean'), method='complete')
    dendrogram(row_clusters, labels=words)
    plt.show()

I will write it.

# "overtime"Write a tree diagram of words similar to
draw_similar_word_dendrogram('overtime')
残業樹形図.png

I'm sorry that the letters of the word are small, but I was able to draw the tree diagram as it was. Obviously, even here, the morning return, the morning, and the last train are lined up. Go home early ...

Grouping can be done by cutting this tree diagram at an appropriate height.

Play with the company ① Try to draw a distribution map

Next, let's draw a distribution map of the company.

Here, let's write a distribution map of each company from the word-of-mouth of each company's ** corporate culture only **. The aim is to find a company with a similar corporate culture.

Use the function called infer_vector of Gensim Doc2Vec when calculating the vector representation using the model already learned. By the way, this function was commented in the article the other day, but honestly, it is not very accurate.

However, I think that it is not a big problem compared to the problem that word2vec is involved in processing company reviews in the first place, so I use it as it is.

First, calculate the company vector representation. The target was a Web-based company with a certain number of reviews or more.

#Read model
model = models.Doc2Vec.load('./data/doc2vec.model')

#Company,Reading word-of-mouth data from DB
companies = connect_mysql(QUERY_COMPANIES, DB_NAME)
reviews = connect_mysql(QUERY_COMPANY_CULTURE, DB_NAME)

#Morphological analysis of word-of-mouth data
# utils.The stem contains the processing of morphological analysis by MeCab.
words = [utils.stems(review) for review in reviews]

#Calculate the vector representation of each company from the word-of-mouth data
vecs = [models.Doc2Vec.infer_vector(model, word) for word in words]

Now that we have calculated the vector representation, let's visualize it.

#Visualization using the method defined above
draw_scatter_plot(vecs, companies)
companies_without_clusters.png

It's cluttered and hard to see, but the recruiting series is stuck in the upper part, and game companies are gathered in the lower part.

However, is it true that Gree and Mixi are in similar positions? Because there are some places, there may be a precision problem of word2vec and infer_vector, and distortion that forced the 100-dimensional vector into 2D.

Play with the company ② Try to draw a distribution map in consideration of clustering

The distribution map shown above was cluttered and hard to see.

If you perform clustering and color the plot, it will be a little easier to see, so let's draw a distribution map after finding the cluster of each company.

import pandas as pd
from sklearn.cluster import KMeans

def kmeans_clustering(tags, vecs, n_clusters):
    """K-means clustering method"""
    km = KMeans(n_clusters=n_clusters,
                init='k-means++',
                n_init=20,
                max_iter=1000,
                tol=1e-04,
                random_state=0)
    clusters = km.fit_predict(vecs)
    return pd.DataFrame(clusters, index=tags)

Execution of clustering and visualization considering it

#The number of clusters is appropriate(I searched for a reasonable number by the Ichiou elbow method)
clusters = kmeans_clustering(companies, vecs, 10)

#Plot the distribution map with cluster information
draw_scatter_plot(vecs, companies, clusters)
companies_with_clusters.png

I wonder if it doesn't change much ... However, it may be an advantage to be able to see the distortion that was forcibly dropped into two dimensions.

Also, Mr. Cookpad and Mr. DMM are in similar positions and it seems to call ripples, but I think that it will be similar to the store where you go to eat because you live together at Yebisu Garden Place, and it will be a similar corporate culture. , Masu ... (painful excuse)

This time, the visualization of clustering was not good, but it may be improved to some extent by changing the method of dimension compression. It seems that it is worthwhile to devise various improvements such as changing the PCA part to manifold.TSNE of Scikit-learn.

Recommended Posts

[Word2vec] Let's visualize the result of natural language processing of company reviews
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
Learn the basics of document classification by natural language processing, topic model
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
View the result of geometry processing in Python
Unbearable shortness of Attention in natural language processing
Python: Natural language processing
RNN_LSTM2 Natural language processing
Performance verification of data preprocessing in natural language processing
Overview of natural language processing and its data preprocessing
Natural language processing with Word2Vec developed by a researcher in the US google (original data)
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
Let's visualize the trading volume of TSE stocks --jpxlab sample
Types of preprocessing in natural language processing and their power
100 language processing knock-29: Get the URL of the national flag image
■ [Google Colaboratory] Preprocessing of Natural Language Processing & Morphological Analysis (janome)
Let's visualize the number of people infected with coronavirus with matplotlib
Scraping the result of "Schedule-kun"
Let's visualize GraphConvModel of DeepChem
Natural language processing 3 Word continuity
Visualize the orbit of Hayabusa2
Natural language processing 2 Word similarity
Flow of getting the result of asynchronous processing using Django and Celery
Dockerfile with the necessary libraries for natural language processing in python
Why is distributed representation of words important for natural language processing?
Study natural language processing with Kikagaku
100 natural language processing knocks Chapter 4 Commentary
Natural language processing for busy people
[Natural language processing] Preprocessing with Japanese
Natural Language: Word2Vec Part3 --CBOW model
Natural Language: Word2Vec Part1 --Japanese Corpus
Artificial language Lojban and natural language processing (artificial language processing)
Let's touch on the Go language
Process the result of% time,% timeit
100 Language Processing Knock-59: Analysis of S-expressions
Preparing to start natural language processing
Natural language processing analyzer installation summary
Visualize the response status of the census 2020
Let's decide the winner of bingo
Natural Language: Word2Vec Part2 --Skip-gram model
Summary of multi-process processing of script language
From the introduction of GoogleCloudPlatform Natural Language API to how to use it
Easy padding of data that can be used in natural language processing
Let's talk about the tone curve of image processing ~ LUT is amazing ~
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
Created a Chrome extension that uses the power of natural language processing to drive dark sites out of the world