[PYTHON] Sentiment analysis of corporate word-of-mouth data of career change meetings using deep learning

Since it is an Advent Calendar article that borrowed the name of the company, I will deal with the material related to the company.

(Introduction) About Job Change Conference

Livesense operates a job change conference, a site that deals with corporate reputation reviews by job changers, and reviews about companies are posted daily.

Until now, at the job change meeting, the text data of word-of-mouth and the score data consisting of 5 stages were acquired as separate data, but with the recent renewal, the score and text data can be posted at the same time, making word-of-mouth easier to read. It is now possible to provide.

Reviews posted before the renewal Reviews posted after the renewal
<img src="https://qiita-image-store.s3.amazonaws.com/0/7307/680fd65f-5b4f-6919-0ab7-b53163d3d0eb.png "before.png " width="400">

A sense of challenge felt here

Although newly posted reviews have become easier to read due to their scores, the large number of posts accumulated in the past naturally does not have five-level rating data, and it is not possible to display a face icon as it is.

However, if this problem is solved and ** past reviews become easier to read as well, it should be a better site! ** **

I decided to try to solve the problem.

Looking for a start. Examine the method

When I looked it up, it seems that this problem is called emotion analysis (other than reputation analysis, Sentiment Analysis or Sentiment Classification).

Furthermore, the method can be roughly divided into the following.

--Judgment method using emotion dictionary -Try rudimentary sentiment analysis for Twitter Stream API data -Result of negative / positive judgment of tweets on Twitter with Python --Method using machine learning -Sentiment analysis of tweets by deep learning

Method 1: Method using an emotion dictionary

The method of using the emotion dictionary is a simple logic, for example, if "overtime" is negative and "no overtime" is canceled, it is judged as positive. However, it soon became difficult.

There wasn't much overtime

For example, in the above case, it seems that it was not possible to solve the problem just by engaging with Chochoi, but to think about various patterns. Furthermore, it seems that such a phrase is often used, probably because of the nature of word-of-mouth for changing jobs. .. ..

I came to the point and gave up this method early

Method 2: Machine learning method

Unknown areas such as machine learning and deep learning. Personally, I was totally ignorant and it seemed to be difficult, but I was also interested in it, so I decided to explore this method.

Find out about sentiment analysis by machine learning

From the fact that I had only loose knowledge, when I proceeded with further investigation by relying on the word "Sentiment Analysis", I found the following things.

――Recurrent neural network (RNN) seems to be good for handling sentences. ――Among them, LSTM tends to improve accuracy. -Predict time series data with neural network

――I found some samples doing Sentiment Analysis.

-Sample using Tensorflow (github) --Japanese is input by dividing it into character units. ――It looked good, but it seemed difficult to control it freely.

-Chainer sample (github) ――Is it in English? It looks like it's done. ――It seems that various preparations are required to use it. -Reference article

-Sample performing character-based RNN with Chaniner (github) ――This seems to generate the next sentence instead of Sentiment Analysis. This looks fun too

-Sample using Theano ――It seems that you are judging a movie review in English. ――I can understand where and what I am doing. -** If you can create data, you can move something ** ――I see hope.

(Main subject from here) Perform sentiment analysis with LSTM

This time, we will use Sample using Theano to proceed with the verification.

(For use, Code-reading notes is a separate article, so if you are interested, please refer to that)

Features of learning data and problem setting

Characteristics of training data

The word-of-mouth data of the job change meeting to be used as learning data this time has the following characteristics for one word-of-mouth.

--Text data --100 characters or more --Review of one company (not very relevant this time) --Review of one question (not very relevant this time)

--Score data --Five-grade evaluation of the company —— 1 is the lowest and 5 is the highest ――A common thing that people choose depending on how many stars they have

Problem setting

Based on the results so far, set the problem again as follows.

Create a model that can judge whether the written word-of-mouth text is positive or negative

Originally, I would like to aim for a 5-grade evaluation according to the data, but it seems that the difficulty level is still high, so I will not do it this time.

Also, this time I will try based on a rough feeling, so assumptions and assumptions that sentences with low scores are "negative" and sentences with high scores are "positive" And proceed.

Data preparation

As a data preparation, we will proceed with preparations like this

--For input, use a dictionary digitized character by character. --Example :) Good company-> [1,1,2,3] --Prepare 1300 reviews with 1 (negative) and 5 (positive) scores in each sample. --Relabeling with 0 for negative and 1 for positive --Modified Data reading script so that it can read its own data. ――This script divides the learning data for training and verification. --Similar to training data, test data is also available. --In this script, the test data is completely independent of learning and is used only for error measurement.

For the time being, try moving it as a sample.

Move when the data is ready. It takes a lot of time by default, so it seemed better to adjust the number of hidden layers appropriately.

Try with a simple script

After waiting for a while and the model is built, throw a simple word for the time being and try to make it work.

Looking at pred_probs () around line 400 of Learning Script, "If you want to verify with a trained model, you can refer to this. Since something like "Yo" is written, verify it with a little script referring to this.

negaposi.py


model = numpy.load("lstm_model.npz")
tparams = lstm.init_tparams(model)

(use_noise, x, mask, y, f_pred_prob, f_pred, cost) = lstm.build_model(tparams, 
    #If you don't throw this option, you will get a KeyError.
    'encoder': 'lstm',
    'dim_proj': 128,
    'use_dropout': True,
})

#Minimal simplification
def pred_probs(f_pred_prob, sentence):
    probs = numpy.zeros((1, 2)).astype(config.floatX)
    x, mask, _y = imdb.prepare_data([sentence],
                              1, #dummy. Appropriate because it is not used.
                              maxlen=None)
    return f_pred_prob(x, mask)[0]

#Input that digitizes each character string
sentences = [
    {
        "data": [27, 72, 104, 150, 19, 8, 106, 23],
        "text": "It's a very good company"
    },
    {
        "data": [27, 72, 104, 402, 121, 73, 8, 106, 23],
        "text": "It ’s a very bad company."
    }
]

for sentence in sentences:
    result = pred_probs(f_pred_prob, sentence["data"])
    print "==="
    print result
    print sentence["text"], ("is positive" if (result[0] < result[1]) else "is negative")

Run

% python sample.py
===
input:It's a very good company=> [27, 72, 104, 150, 19, 8, 106, 23]
output: [ 0.06803907  0.93196093]
Very good company is positive
===
input:It ’s a very bad company.=> [27, 72, 104, 402, 121, 73, 8, 106, 23]
output: [ 0.73581125  0.26418875]
It's a very bad company is negative

Somehow it came out like that. It's amazing. The output is the result of passing through the LSTM, and it seems that the probability of each class is returned.

I felt like I could go, so I will proceed with further verification.

Test results

Let's take a closer look at the execution results while playing with various parameters. For the time being, the results of dim_proj = 8 (hidden layer) and validFreq = 30 (verification frequency), which seemed to be good as such this time.

Transition of error rate

The script executed this time records the error transition between the verification data and the test data. It seems that this is the transition.

Since one epoch is finished around 2600 update, which finishes the verification data, it can be seen that the error is shrinking after that.

Also, in the training script, early stopping is done, so it can be seen that it stops moderately before overfitting occurs.

Result analysis of Test data

In the verification, the error rate of the test data is about 0.2 to 0.3 in the end, so you can expect a correct answer rate of about 70 to 80%.

Let's take a look at the data used as Test to see if it really is.

View the distribution of scores and emotional classifications

If you extend the negaposi.py used earlier, classify the test samples by negative / positve, and get the distribution of each, it looks like this

Score negative(%) positive(%)
★ 1 review 84.34 15.66
★ 2 reviews 66.53 33.47
★ 3 reviews 45.07 54.93
★ 4 reviews 25.77 74.23
★ 5 reviews 27.59 72.41

You can see something like this

--Impression: Looks good -★ 1 and ★ 5 have a correct answer rate of 70% to 80%. After all it seems that there is no big difference from the result. --★ 4 and ★ 2, which were excluded at the time of the test, are also reasonable values. -★ It's a little interesting that 3 is a number close to 50-50. ――If you take a closer look at the contents, you may notice something.

Received the verification result

――Result: It feels like success

――If you think about improving the accuracy a little more, this seems to be possible. ――Isn't the accuracy improved if you narrow down the information such as age and industry? --Adjust more parameters such as hidden layers? --Is the amount of data still small? Wait a little longer?

Impressions

――While I was investigating, it was "I don't understand at all ...", but when I tried it while moving it, I often found out. ――It's important to try it for the time being ――When I read Deep Learning at the timing when all the things I didn't understand were exhausted, I felt that my brain was filled up and it was a great help. It was. ――I struggled for a few days, but in many ways I was able to feel the ridiculousness of the world of machine learning. ――Potential that even if it becomes a required subject for engineers in a few years, it will not be strange -pandas It was easy to handle and it was awesome. Thank you panda.

Recommended Posts

Sentiment analysis of corporate word-of-mouth data of career change meetings using deep learning
Sentiment analysis of tweets with deep learning
Recommendation of data analysis using MessagePack
Sentiment analysis of large-scale tweet data by NLTK
Stock price forecast using deep learning [Data acquisition]
[Python] [Word] [python-docx] Simple analysis of diff data using python
Collection and automation of erotic images using deep learning
Data analysis using xarray
Data analysis using Python 0
Deep learning 1 Practice of deep learning
Examination of Forecasting Method Using Deep Learning and Wavelet Transform-Part 2-
Classify CIFAR-10 image datasets using various models of deep learning
Deep running 2 Tuning of deep learning
Python data analysis learning notes
Data analysis using python pandas
Deep reinforcement learning 2 Implementation of reinforcement learning
[Anomaly detection] Try using the latest method of deep distance learning