Introduction

I am currently conducting research activities under the theme of "How to enjoy my work?" I think that collecting word-of-mouth information of companies will give me some hints, so I will try to analyze the data!

environment

macOS Mojave
Python 3.7.4
Google Chrome 79.0.3945.79
ChromeDriver 79.0.3945.36
selenium 3.141.0
mecab-python3 0.996.2

Scraping

Review the scraping precautions before implementing. List of precautions for web scraping --Qiita

Refer to the following articles to prepare the necessary tools. Complete automatic operation of Chrome with Python + Selenium-Qiita

Collect reviews from job change sites. The code is here

Morphological analysis

Refer to the following articles to prepare the necessary tools. Use mecab on Mac. --Qiita

Run the code below to share the reviews.

`KeitaisoKaiseki.py`


# coding: utf-8
import MeCab

mecab = MeCab.Tagger ('-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')

INPUT_FILE_PATH = "./scraping.csv"
OUTPUT_FILE_PATH = "./mecab.txt"

with open(INPUT_FILE_PATH) as f:
    text = f.read()

# mecab.parse('')#Prevent strings from being GC
node = mecab.parseToNode(text)
while node:
    #Get the word
    word = node.surface
    #Get part of speech
    pos = node.feature.split(",")[1]
    tmp_str = '{0} , {1}\n'.format(word, pos)

    with open(OUTPUT_FILE_PATH, mode='a') as f:
        f.write(tmp_str)
    
    #Advance to the next word
    node = node.next

If this is left as it is, all the words will be meaningless such as particles, so extract the part of speech that seems to have meaning as a word.

$ grep -e "noun" -e "verb" -e "adjective" mecab.txt | grep -v "代noun" | cut -d',' -f 1 > mecab_edited.txt

Word cloud

Word cloud is used to visualize the frequency of appearance of shaped reviews. Implement by referring to the following article.

Text mining with Python (2) Visualization with Word Cloud --Qiita

`WordCloud.py`


# coding: utf-8
from wordcloud import WordCloud

FONT_PATH = "/System/Library/Fonts/Hiragino Horn Gothic W9.ttc"

INPUT_FILE_PATH = "./mecab_edited.txt"
OUTPUT_FILE_PATH = "./wordcloud.png "

with open(INPUT_FILE_PATH) as f:
    text = f.read()

stop_words = ["From the output image", "I want to remove", "Word", "Please set"]

wordcloud = WordCloud(background_color="white",
    font_path=FONT_PATH,
    width=800,height=600,
    stopwords=set(stop_words)).generate(text)  

wordcloud.to_file(OUTPUT_FILE_PATH)

Since there is still noise, "Stop words unnecessary words [https://qiita.com/Hironsan/items/2466fe0f344115aff177#%E3%82%B9%E3%83%88%E3%83%83%E3" Register with% 83% 97% E3% 83% AF% E3% 83% BC% E3% 83% 89% E3% 81% AE% E9% 99% A4% E5% 8E% BB) → Word cloud generation ”, Repeat until you get results that you can analyze.

And the following image was obtained. It was generated from the word of mouth of a certain IT company.

Even if you look at this image, it is difficult to understand the tendency of the company, so I will focus on the characteristic word " motivation "and analyze it further. Is it really motivated? Is it low?

analysis

First, extract reviews that include the string " motivation ".

$ grep "motivation" scraping.csv > scraping_motivation.csv

I've cut down my sleep time too much, so I'm going to use an external service from here. → AI text mining by user local

Dependency analysis

1_名詞⇔形容詞.png

2_名詞⇔動詞.png

3_名詞⇔名詞.png

What I learned by analyzing the reviews of the job change site

As a result of parsing analysis of "motivation", positive words such as "high" and "increase" are ranked high in the frequency of appearance, so "** Employees of this company seem to be highly motivated. ** "was found.

in conclusion

Although I was almost an amateur, I tried to analyze data by imitating the appearance. As a result, I couldn't get to the information I wanted, but I was able to experience that visualizing the data could lead to new discoveries. Perhaps we have to totally design the data and analysis methods to be collected in order to achieve the purpose. (It looks difficult ...)

"How to enjoy work?" → "Highly motivated" → Why?

However, it seems difficult to derive all the answers by relying only on the data, so while using the data as a support tool, Isn't it the best shortcut to repeat hypothesis testing with the user? Yes, like design thinking.

[PYTHON] I found out by analyzing the reviews of the job change site! ??