[PYTHON] I found out by analyzing the reviews of the job change site! ??

Introduction

I am currently conducting research activities under the theme of "How to enjoy my work?" I think that collecting word-of-mouth information of companies will give me some hints, so I will try to analyze the data!

environment

Scraping

Review the scraping precautions before implementing. List of precautions for web scraping --Qiita

Refer to the following articles to prepare the necessary tools. Complete automatic operation of Chrome with Python + Selenium-Qiita

Collect reviews from job change sites. The code is here

Morphological analysis

Refer to the following articles to prepare the necessary tools. Use mecab on Mac. --Qiita

Run the code below to share the reviews.

KeitaisoKaiseki.py


# coding: utf-8
import MeCab

mecab = MeCab.Tagger ('-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')

INPUT_FILE_PATH = "./scraping.csv"
OUTPUT_FILE_PATH = "./mecab.txt"

with open(INPUT_FILE_PATH) as f:
    text = f.read()

# mecab.parse('')#Prevent strings from being GC
node = mecab.parseToNode(text)
while node:
    #Get the word
    word = node.surface
    #Get part of speech
    pos = node.feature.split(",")[1]
    tmp_str = '{0} , {1}\n'.format(word, pos)

    with open(OUTPUT_FILE_PATH, mode='a') as f:
        f.write(tmp_str)
    
    #Advance to the next word
    node = node.next

If this is left as it is, all the words will be meaningless such as particles, so extract the part of speech that seems to have meaning as a word.

$ grep -e "noun" -e "verb" -e "adjective" mecab.txt | grep -v "代noun" | cut -d',' -f 1 > mecab_edited.txt

Word cloud

Word cloud is used to visualize the frequency of appearance of shaped reviews. Implement by referring to the following article.

Text mining with Python (2) Visualization with Word Cloud --Qiita

WordCloud.py


# coding: utf-8
from wordcloud import WordCloud

FONT_PATH = "/System/Library/Fonts/Hiragino Horn Gothic W9.ttc"

INPUT_FILE_PATH = "./mecab_edited.txt"
OUTPUT_FILE_PATH = "./wordcloud.png "

with open(INPUT_FILE_PATH) as f:
    text = f.read()

stop_words = ["From the output image", "I want to remove", "Word", "Please set"]

wordcloud = WordCloud(background_color="white",
    font_path=FONT_PATH,
    width=800,height=600,
    stopwords=set(stop_words)).generate(text)  

wordcloud.to_file(OUTPUT_FILE_PATH)

Since there is still noise, "Stop words unnecessary words [https://qiita.com/Hironsan/items/2466fe0f344115aff177#%E3%82%B9%E3%83%88%E3%83%83%E3" Register with% 83% 97% E3% 83% AF% E3% 83% BC% E3% 83% 89% E3% 81% AE% E9% 99% A4% E5% 8E% BB) → Word cloud generation ”, Repeat until you get results that you can analyze.

And the following image was obtained. It was generated from the word of mouth of a certain IT company.

wordcloud.png

Even if you look at this image, it is difficult to understand the tendency of the company, so I will focus on the characteristic word " motivation "and analyze it further. Is it really motivated? Is it low?

analysis

First, extract reviews that include the string " motivation ".

$ grep "motivation" scraping.csv > scraping_motivation.csv

I've cut down my sleep time too much, so I'm going to use an external service from here. → AI text mining by user local

Dependency analysis

1_名詞⇔形容詞.png

2_名詞⇔動詞.png

3_名詞⇔名詞.png

What I learned by analyzing the reviews of the job change site

As a result of parsing analysis of "motivation", positive words such as "high" and "increase" are ranked high in the frequency of appearance, so "** Employees of this company seem to be highly motivated. ** "was found.

in conclusion

Although I was almost an amateur, I tried to analyze data by imitating the appearance. As a result, I couldn't get to the information I wanted, but I was able to experience that visualizing the data could lead to new discoveries. Perhaps we have to totally design the data and analysis methods to be collected in order to achieve the purpose. (It looks difficult ...)

"How to enjoy work?" → "Highly motivated" → Why?

However, it seems difficult to derive all the answers by relying only on the data, so while using the data as a support tool, Isn't it the best shortcut to repeat hypothesis testing with the user? Yes, like design thinking.

Recommended Posts

I found out by analyzing the reviews of the job change site! ??
What I saw by analyzing the data of the engineer market
Find out the mystery change of Pokédex description by Levenshtein distance
I found out by making a python script to record radiko while reading the code of the predecessors
I compared the identity of the images by Hu moment
I checked out the versions of Blender and Python
I tried scraping the advertisement of the pirated cartoon site
[Python] Let's change the URL of the Django administrator site
I found the cause of mysterious communication of Minecraft server (Spigot)
Change the theme of Jupyter
Change the style of matplotlib
Can I pass the first grade of math test by programming?
I summarized how to change the boot parameters of GRUB and GRUB2
What happens when I change the hyperparameters of SVM (RBF kernel)?
Change the background of Ubuntu (GNOME)
I investigated the mechanism of flask-login!
Knowledge found by analyzing malware Mirai
Change the Python version of Homebrew
Analyzing the age-specific severity of coronavirus
Change the suffix of django-filter / DateFromToRangeFilter
I want to change the color by clicking the scatter point in matplotlib
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to verify and analyze the acceleration of Python by Cython
I checked the number of closed and opened stores nationwide by Corona
Have Siri do the job of reading AWS bills posted by Slackbot
I tried to verify the result of A / B test by chi-square test