[PYTHON] Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text

Natural language processing that is almost always found in machine learning.

The words seem to be difficult, so it's hard to pinpoint.

However, it is actually very indispensable for us to spend our daily life.

Moreover, even beginners can fully shape it if they prepare the necessary data set and know how to design it.

Therefore, this time, we will actually create a program from Python using natural language processing.

The challenge here is sentiment analysis. Based on a data set that collects evaluations of women's clothing, we will predict which evaluation the newly entered text corresponds to.

That said, let's start with the actual coding by telling you what natural language processing is and what sentiment analysis is.

What is natural language processing and sentiment analysis in the first place?

In explaining what natural language processing is, it is necessary to clarify the difference between the two languages.

The two languages are natural language and programming language. The difference between the two lies in the many possible meanings and interpretations of the sentence.

An example of natural language is "a girl with big black eyes". With this alone, at least two interpretations can be made.

One is "black", "big eyes", and "girl", that is, a girl with big eyes and white skin. The other is "black eyes" and "big girl", that is, a tall girl with black eyes.

As you can see, there are some ambiguities in words in natural language.

An example of a programming language is "4 * 6 + 1".

In this case, it can be interpreted as multiplying 4 by 6 and adding 1, so there is no room for thinking about other patterns.

In this way, programming languages are unambiguous because they allow computers to always interpret and operate the same sentences in the same way.

Natural language processing is a technology for practically handling a huge amount of text data based on the ambiguity of natural language.

Among them, sentiment analysis is quantified based on the emotional elements in the text. It can be expected to be used for various feedbacks such as analyzing the evaluation of products on the Internet.

What to do this time

This time based on a dataset "Womens Clothing E-Commerce Reviews.csv" that summarizes reviews from Kaggle to women's clothing. We will perform natural language processing.

What is used as sentiment analysis is the numerical value of the evaluation of the clothes in it.

Here, the numbers are in 5 stages from 1 to 5, analyze what words are used in each evaluation, and finally make it possible to predict which evaluation the sentence you entered corresponds to. ..

Load library and data

Since I touched on the data set used for natural language processing and what to target for sentiment analysis, I will explain the creation process step by step.

First, load the necessary libraries and data.

import numpy as np
import pandas as pd

from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy

review = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
#Directory needs to be adjusted accordingly
review = pd.DataFrame(review)

Let's take a look at each role.

numpy

numpy is a library included in Python that efficiently performs numerical calculations. When it comes to machine learning, the model is learned by repeating operations on multidimensional arrays such as vectors and matrices. At that time, if you use numpy, you can calculate efficiently. So the existence of this library is indispensable.

pandas

Pandas efficiently perform the work required for data analysis. In the process of data analysis, pre-processing up to machine learning accounts for 80% to 90% of the total. Specifically, arrange the data neatly so that machine learning can be performed properly, such as reading data and filling in missing values. Pandas has all the features you need, so you can work efficiently.

nltk

nltk is a platform for creating programs that process human language data in Python. It is arranged so that various processes such as sentence analysis and classification can be performed.

Overview of data and confirmation of missing values

Let's take a look at the overview of the data and the missing values by looking at each role.

What is included in the data

review.columns
Index(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
       'Recommended IND', 'Positive Feedback Count', 'Division Name',
       'Department Name', 'Class Name'],
      dtype='object')
review = review.rename(columns={"Unnamed: 0":"ID",'Review Text': 'Text'})
#"Unnamed: 0"To"ID"、'Review Text'To'Text'という風に呼称To変更する。
review = review[["ID","Text","Rating"]]
#Only IDs, texts and numbers are used here.

Missing data

#isnull().sum()See the number of missing values with.
review.isnull().sum()
ID          0
Text      845
Rating      0
dtype: int64
#It can be seen that there are 845 cases in which the text is not written just by adding the evaluation numbers.

#Here dropna()Use to delete such data.
review = review.dropna()
review.isnull().sum()
ID        0
Text      0
Rating    0
dtype: int64
#You can now delete it.

The beginning of the data

review.head()#Display the beginning of the data. head()use.
ID Text Rating
0 Absolutely wonderful - silky and sexy and comf... 4
1 Love this dress! it's sooo pretty. i happene... 5
2 I had such high hopes for this dress and reall... 3
3 I love, love, love this jumpsuit. it's fun, fl... 5
4 This shirt is very flattering to all due to th... 5

The big picture of the data

review.describe()#View the appearance of the data. describe describe()use.
#The number of data is 22,It turns out that 461 reviews, nearly half of the reviews, are 5.
|ID|Rating|

|:-----------|------------:|:------------:| |count |22641.000000 |22641.000000| |mean |11740.849035 |4.183561| |std |6781.957509 |1.115762| |min |0.000000 |1.000000| |25% |5872.000000 |4.000000| |50% |11733.000000 |5.000000| |75% |17621.000000 |5.000000| |max |23485.000000 |5.000000|

Check data type

review.dtypes#Use dtypes. If there is any inconvenience in data processing or operation, it may be necessary to change the type.
ID         int64
Text      object
Rating     int64
dtype: object

Let's learn

Now that we have confirmed the data and deleted the missing values, let's start learning immediately.

First of all, divide the sentences of all data into a list for each rate number, divide it into words and store it.

rate_id_one = []
rate_id_two = []
rate_id_three = []
rate_id_four = []
rate_id_five =[]

for text, rating in zip(review['Text'], review['Rating']):
    line = text.split(" ")
    dic = {}
    if rating == 1:
        for word in line:
            dic[word] = True
        ireru = (dic, 1)
        rate_id_one.append(ireru)
    elif rating == 2:
        for word in line:
            dic[word] = True
        ireru = (dic, 2)
        rate_id_two.append(ireru) 
    elif rating == 3:
        for word in line:
            dic[word] = True
        ireru = (dic, 3)
        rate_id_three.append(ireru)
    elif rating == 4:
        for word in line:
            dic[word] = True
        ireru = (dic, 4)
        rate_id_four.append(ireru)
    else:
        for word in line:
            dic[word] = True
        ireru = (dic, 5)
        rate_id_five.append(ireru)

Now that we have sorted by number, we will divide it into 8: 2 with learning data and test data.

The total is the sum of each number for the learning test.

threshold = 0.8
num_one = int(threshold * len(rate_id_one))
num_two = int(threshold * len(rate_id_two))
num_three = int(threshold * len(rate_id_three))
num_four = int(threshold * len(rate_id_four))
num_five = int(threshold * len(rate_id_five))

features_train = rate_id_one[:num_one] + rate_id_two[:num_two] + rate_id_three[:num_three] + rate_id_four[:num_four] + rate_id_five[:num_five]
features_test = rate_id_one[num_one:] + rate_id_two[num_two:] + rate_id_three[num_three:] + rate_id_four[num_four:] + rate_id_five[num_five:]
print("Number of training datapoints:", len(features_train))
print("Number of test datapoints:", len(features_test))

Number of training datapoints: 18111
Number of test datapoints: 4530

I have divided it into two parts, one for learning and the other for testing, so I will start learning.

And I made the test data judge the numbers with what I learned, but the correct answer rate seems to be less than half.

classifier = NaiveBayesClassifier.train(features_train)
print('Accuracy of the classifier:', nltk_accuracy(classifier, features_test))

Accuracy of the classifier: 0.4640176600441501

The cause may be that there is a wide range of choices from 1 to 5 and that the absolute number of low-rated sentence data is insufficient. For example, if you choose between 1 and 5 rates, it may improve.

Alternatively, if other methods are used, the accuracy may be further improved, so this will be an issue for the future.

Let's see what words influenced the predicted numbers during the training.

N = 15
print('Top ' + str(N) + ' most informative words:')
for i, item in enumerate(classifier.most_informative_features()[:N]):
    print(str(i+1) + '. ' + item[0]) 

Top 15 most informative words:
1. worst
2. shame
3. poorly
4. horrible
5. disappointment.
6. cheap.
7. strange.
8. sad.
9. dull
10. terrible.
11. returned.
12. terrible
13. awkward.
14. concept
15. awful.

Negative words such as worst, shame, and disappointment stand out.

A straightforward expression of negativeness will be the decisive factor that influences the numbers.

Some words have a period, but this time I will look at this as one of the words.

Let's actually write a review

Now let's write our own sentences and predict the rate.

def extract_features(words):
    return dict([(word, True) for word in words])
#Divide into words like the large amount of review text earlier

input_review = input()
print("Clothes review predictions:")

print("\nReview:",input_review)
features = extract_features(input_review.split())
probabilities = classifier.prob_classify(features)
predicted_sentiment = probabilities.max()
print("Predicted sentiment:", predicted_sentiment)
print("Probability:", round(probabilities.prob(predicted_sentiment), 2))
#Calculate and output which number the input text is most likely to correspond to

For example, enter "I cannnot believe how terrible is it!" Here. In Japanese, it means "I can't believe this terrible!"

I cannnot believe how terrible is it!
Clothes review predictions:

Review: I cannnot believe how terrible is it!
Predicted sentiment: 1
Probability: 0.61

It turns out that it is likely to be the lowest rating.

At the end

This time, after touching on what natural language processing and sentiment analysis are, we actually implemented natural language processing using the dataset in Kaggle.

Even if you are a beginner in programming, it is quite possible to implement it if you get the necessary data and take appropriate steps. And the steps actually taken are as follows.

  1. Load the library and data

  2. Check the data, process it, and prepare it for learning.

  3. Learn and check performance

If you understand the general setup and can actually write and move it, you can apply it to other data sets. So it's a good idea to hold this down and use it in your own dataset.

Finally, I will post this code, so please refer to it.

This code

import numpy as np
import pandas as pd

from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy

review = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
#Directory needs to be adjusted accordingly
review = pd.DataFrame(review)


review.columns
Index(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
       'Recommended IND', 'Positive Feedback Count', 'Division Name',
       'Department Name', 'Class Name'],
      dtype='object')

review = review.rename(columns={"Unnamed: 0":"ID",'Review Text': 'Text'})
review = review[["ID","Text","Rating"]]#Only ID, text and numbers are used here.

review.isnull().sum()
ID          0
Text      845
Rating      0
dtype: int64

review = review.dropna()
review.isnull().sum()
ID        0
Text      0
Rating    0
dtype: int64

review.head()#Show the beginning of the data
review.describe()#View the appearance of the data

review.dtypes
ID         int64
Text      object
Rating     int64
dtype: object

rate_id_one = []
rate_id_two = []
rate_id_three = []
rate_id_four = []
rate_id_five =[]

for text, rating in zip(review['Text'], review['Rating']):
    line = text.split(" ")
    dic = {}
    if rating == 1:
        for word in line:
            dic[word] = True
        ireru = (dic, 1)
        rate_id_one.append(ireru)
    elif rating == 2:
        for word in line:
            dic[word] = True
        ireru = (dic, 2)
        rate_id_two.append(ireru) 
    elif rating == 3:
        for word in line:
            dic[word] = True
        ireru = (dic, 3)
        rate_id_three.append(ireru)
    elif rating == 4:
        for word in line:
            dic[word] = True
        ireru = (dic, 4)
        rate_id_four.append(ireru)
    else:
        for word in line:
            dic[word] = True
        ireru = (dic, 5)
        rate_id_five.append(ireru)

rate_id_one[0]#Show the words in the list

len(rate_id_one)
821

threshold = 0.8
num_one = int(threshold * len(rate_id_one))
num_two = int(threshold * len(rate_id_two))
num_three = int(threshold * len(rate_id_three))
num_four = int(threshold * len(rate_id_four))
num_five = int(threshold * len(rate_id_five))

features_train = rate_id_one[:num_one] + rate_id_two[:num_two] + rate_id_three[:num_three] + rate_id_four[:num_four] + rate_id_five[:num_five]
features_test = rate_id_one[num_one:] + rate_id_two[num_two:] + rate_id_three[num_three:] + rate_id_four[num_four:] + rate_id_five[num_five:]
print("Number of training datapoints:", len(features_train))
print("Number of test datapoints:", len(features_test))
Number of training datapoints: 18111
Number of test datapoints: 4530

classifier = NaiveBayesClassifier.train(features_train)
print('Accuracy of the classifier:', nltk_accuracy(classifier, features_test))

Accuracy of the classifier: 0.4640176600441501

N = 15
print('Top ' + str(N) + ' most informative words:')
for i, item in enumerate(classifier.most_informative_features()[:N]):
    print(str(i+1) + '. ' + item[0]) 

Top 15 most informative words:
1. worst
2. shame
3. poorly
4. horrible
5. disappointment.
6. cheap.
7. strange.
8. sad.
9. dull
10. terrible.
11. returned.
12. terrible
13. awkward.
14. concept
15. awful.

def extract_features(words):
    return dict([(word, True) for word in words])

#See what you typed on the spot
input_review = input()
print("Clothes review predictions:")

print("\nReview:",input_review)
features = extract_features(input_review.split())
probabilities = classifier.prob_classify(features)
predicted_sentiment = probabilities.max()
print("Predicted sentiment:", predicted_sentiment)
print("Probability:", round(probabilities.prob(predicted_sentiment), 2))

I cannnot believe how terrible is it!
Clothes review predictions:

Review: I cannnot believe how terrible is it!
Predicted sentiment: 1
Probability: 0.61

Recommended Posts

Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
I tried to extract named entities with the natural language processing library GiNZA
I tried to verify whether the Natural Language API (sentiment analysis) supports net slang.
I tried natural language processing with transformers.
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
I tried to predict the J-League match (data analysis)
I tried to classify Mr. Habu and Mr. Habu with natural language processing × naive Bayes classifier
[For beginners] Language analysis using the natural language processing tool "GiNZA" (from morphological analysis to vectorization)
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
The first artificial intelligence. I wanted to try natural language processing, so I will try morphological analysis using MeCab with python3.
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
Summarize how to preprocess text (natural language processing) with tf.data.Dataset api
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
I tried to learn the angle from sin and cos with chainer
I tried to visualize the text of the novel "Weathering with You" with WordCloud
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to predict next year with AI
I tried to save the data with discord
[Python] I played with natural language processing ~ transformers ~
I tried to predict Titanic survival with PyCaret
I tried to predict the price of ETF
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
[Python] [Natural language processing] I tried Deep Learning ❷ made from scratch in Japanese ①
I tried to predict the horses that will be in the top 3 with LightGBM
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
[Python] I tried the same calculation as LSTM predict with from scratch [Keras]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
I tried to predict the number of people infected with coronavirus in consideration of the effect of refraining from going out
I tried to learn the sin function with chainer
I tried to detect the iris from the camera image
I tried Amazon Comprehend sentiment analysis with AWS CLI.
I tried to touch the CSV file with Python
I tried to solve the soma cube with python
3. Natural language processing with Python 4-1. Analysis for words with KWIC
I tried to solve the problem with Python Vol.1
I tried to identify the language using CNN + Melspectogram
I tried to compare the processing speed with dplyr of R and pandas of Python
I tried to predict and submit Titanic survivors with Kaggle
I tried to find the entropy of the image with python
I tried to simulate how the infection spreads with Python
I tried to analyze the whole novel "Weathering with You" ☔️
I tried using the Python library from Ruby with PyCall
I tried to find the average of the sequence with TensorFlow
I tried to notify the train delay information with LINE Notify
I tried morphological analysis of the general review of Kusoge of the Year
3. Natural language processing with Python 5-5. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (words)]
Natural language processing 1 Morphological analysis
I tried to illustrate the time and time in C language
Day 71 I tried to predict how long this self-restraint will continue with the SIR model
I tried 100 language processing knock 2020
I tried to predict Boston real estate prices with PyCaret
I tried to predict the genre of music from the song title on the Recurrent Neural Network
I tried changing the python script from 2.7.11 to 3.6.0 on windows10
I tried to divide the file into folders with Python
I tried to get various information from the codeforces API
I tried to divide with a deep learning language model
Text sentiment analysis with ML-Ask
I tried to predict the sales of game software with VARISTA by referring to the article of Codexa
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
Introduction to AI creation with Python! Part 1 I tried to classify and predict what the numbers are from the handwritten number images.