I am currently conducting research activities under the theme of "How to enjoy my work?" I think that collecting word-of-mouth information of companies will give me some hints, so I will try to analyze the data!
Review the scraping precautions before implementing. List of precautions for web scraping --Qiita
Refer to the following articles to prepare the necessary tools. Complete automatic operation of Chrome with Python + Selenium-Qiita
Collect reviews from job change sites. The code is here
Refer to the following articles to prepare the necessary tools. Use mecab on Mac. --Qiita
Run the code below to share the reviews.
KeitaisoKaiseki.py
# coding: utf-8
import MeCab
mecab = MeCab.Tagger ('-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')
INPUT_FILE_PATH = "./scraping.csv"
OUTPUT_FILE_PATH = "./mecab.txt"
with open(INPUT_FILE_PATH) as f:
text = f.read()
# mecab.parse('')#Prevent strings from being GC
node = mecab.parseToNode(text)
while node:
#Get the word
word = node.surface
#Get part of speech
pos = node.feature.split(",")[1]
tmp_str = '{0} , {1}\n'.format(word, pos)
with open(OUTPUT_FILE_PATH, mode='a') as f:
f.write(tmp_str)
#Advance to the next word
node = node.next
If this is left as it is, all the words will be meaningless such as particles, so extract the part of speech that seems to have meaning as a word.
$ grep -e "noun" -e "verb" -e "adjective" mecab.txt | grep -v "代noun" | cut -d',' -f 1 > mecab_edited.txt
Word cloud is used to visualize the frequency of appearance of shaped reviews. Implement by referring to the following article.
Text mining with Python (2) Visualization with Word Cloud --Qiita
WordCloud.py
# coding: utf-8
from wordcloud import WordCloud
FONT_PATH = "/System/Library/Fonts/Hiragino Horn Gothic W9.ttc"
INPUT_FILE_PATH = "./mecab_edited.txt"
OUTPUT_FILE_PATH = "./wordcloud.png "
with open(INPUT_FILE_PATH) as f:
text = f.read()
stop_words = ["From the output image", "I want to remove", "Word", "Please set"]
wordcloud = WordCloud(background_color="white",
font_path=FONT_PATH,
width=800,height=600,
stopwords=set(stop_words)).generate(text)
wordcloud.to_file(OUTPUT_FILE_PATH)
Since there is still noise, "Stop words unnecessary words [https://qiita.com/Hironsan/items/2466fe0f344115aff177#%E3%82%B9%E3%83%88%E3%83%83%E3" Register with% 83% 97% E3% 83% AF% E3% 83% BC% E3% 83% 89% E3% 81% AE% E9% 99% A4% E5% 8E% BB) → Word cloud generation ”, Repeat until you get results that you can analyze.
And the following image was obtained. It was generated from the word of mouth of a certain IT company.
Even if you look at this image, it is difficult to understand the tendency of the company, so I will focus on the characteristic word " motivation
"and analyze it further.
Is it really motivated? Is it low?
First, extract reviews that include the string " motivation
".
$ grep "motivation" scraping.csv > scraping_motivation.csv
I've cut down my sleep time too much, so I'm going to use an external service from here. → AI text mining by user local
As a result of parsing analysis of "motivation
", positive words such as "high
" and "increase
" are ranked high in the frequency of appearance, so "** Employees of this company seem to be highly motivated. ** "was found.
Although I was almost an amateur, I tried to analyze data by imitating the appearance. As a result, I couldn't get to the information I wanted, but I was able to experience that visualizing the data could lead to new discoveries. Perhaps we have to totally design the data and analysis methods to be collected in order to achieve the purpose. (It looks difficult ...)
"How to enjoy work?" → "Highly motivated" → Why?
However, it seems difficult to derive all the answers by relying only on the data, so while using the data as a support tool, Isn't it the best shortcut to repeat hypothesis testing with the user? Yes, like design thinking.
Recommended Posts