[PYTHON] Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text

Natural language processing that is almost always found in machine learning.

The words seem to be difficult, so it's hard to pinpoint.

However, it is actually very indispensable for us to spend our daily life.

Moreover, even beginners can fully shape it if they prepare the necessary data set and know how to design it.

Therefore, this time, we will actually create a program from Python using natural language processing.

The challenge here is sentiment analysis. Based on a data set that collects evaluations of women's clothing, we will predict which evaluation the newly entered text corresponds to.

That said, let's start with the actual coding by telling you what natural language processing is and what sentiment analysis is.

What is natural language processing and sentiment analysis in the first place?

In explaining what natural language processing is, it is necessary to clarify the difference between the two languages.

The two languages are natural language and programming language. The difference between the two lies in the many possible meanings and interpretations of the sentence.

An example of natural language is "a girl with big black eyes". With this alone, at least two interpretations can be made.

One is "black", "big eyes", and "girl", that is, a girl with big eyes and white skin. The other is "black eyes" and "big girl", that is, a tall girl with black eyes.

As you can see, there are some ambiguities in words in natural language.

An example of a programming language is "4 * 6 + 1".

In this case, it can be interpreted as multiplying 4 by 6 and adding 1, so there is no room for thinking about other patterns.

In this way, programming languages are unambiguous because they allow computers to always interpret and operate the same sentences in the same way.

Natural language processing is a technology for practically handling a huge amount of text data based on the ambiguity of natural language.

Among them, sentiment analysis is quantified based on the emotional elements in the text. It can be expected to be used for various feedbacks such as analyzing the evaluation of products on the Internet.

What to do this time

This time based on a dataset "Womens Clothing E-Commerce Reviews.csv" that summarizes reviews from Kaggle to women's clothing. We will perform natural language processing.

What is used as sentiment analysis is the numerical value of the evaluation of the clothes in it.

Here, the numbers are in 5 stages from 1 to 5, analyze what words are used in each evaluation, and finally make it possible to predict which evaluation the sentence you entered corresponds to. ..

Load library and data

Since I touched on the data set used for natural language processing and what to target for sentiment analysis, I will explain the creation process step by step.

First, load the necessary libraries and data.

import numpy as np
import pandas as pd

from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy

review = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
#Directory needs to be adjusted accordingly
review = pd.DataFrame(review)

Let's take a look at each role.

numpy

numpy is a library included in Python that efficiently performs numerical calculations. When it comes to machine learning, the model is learned by repeating operations on multidimensional arrays such as vectors and matrices. At that time, if you use numpy, you can calculate efficiently. So the existence of this library is indispensable.

pandas

Pandas efficiently perform the work required for data analysis. In the process of data analysis, pre-processing up to machine learning accounts for 80% to 90% of the total. Specifically, arrange the data neatly so that machine learning can be performed properly, such as reading data and filling in missing values. Pandas has all the features you need, so you can work efficiently.

nltk

nltk is a platform for creating programs that process human language data in Python. It is arranged so that various processes such as sentence analysis and classification can be performed.

Overview of data and confirmation of missing values

Let's take a look at the overview of the data and the missing values by looking at each role.

What is included in the data

review.columns
Index(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
       'Recommended IND', 'Positive Feedback Count', 'Division Name',
       'Department Name', 'Class Name'],
      dtype='object')
review = review.rename(columns={"Unnamed: 0":"ID",'Review Text': 'Text'})
#"Unnamed: 0"To"ID"、'Review Text'To'Text'という風に呼称To変更する。
review = review[["ID","Text","Rating"]]
#Only IDs, texts and numbers are used here.

Missing data

#isnull().sum()See the number of missing values with.
review.isnull().sum()
ID          0
Text      845
Rating      0
dtype: int64
#It can be seen that there are 845 cases in which the text is not written just by adding the evaluation numbers.

#Here dropna()Use to delete such data.
review = review.dropna()
review.isnull().sum()
ID        0
Text      0
Rating    0
dtype: int64
#You can now delete it.

The beginning of the data

review.head()#Display the beginning of the data. head()use.

ID	Text	Rating
0	Absolutely wonderful - silky and sexy and comf...	4
1	Love this dress! it's sooo pretty. i happene...	5
2	I had such high hopes for this dress and reall...	3
3	I love, love, love this jumpsuit. it's fun, fl...	5
4	This shirt is very flattering to all due to th...	5

The big picture of the data

review.describe()#View the appearance of the data. describe describe()use.
#The number of data is 22,It turns out that 461 reviews, nearly half of the reviews, are 5.

|ID|Rating|

|:-----------|------------:|:------------:| |count |22641.000000 |22641.000000| |mean |11740.849035 |4.183561| |std |6781.957509 |1.115762| |min |0.000000 |1.000000| |25% |5872.000000 |4.000000| |50% |11733.000000 |5.000000| |75% |17621.000000 |5.000000| |max |23485.000000 |5.000000|

Check data type

review.dtypes#Use dtypes. If there is any inconvenience in data processing or operation, it may be necessary to change the type.
ID         int64
Text      object
Rating     int64
dtype: object

Let's learn

Now that we have confirmed the data and deleted the missing values, let's start learning immediately.

First of all, divide the sentences of all data into a list for each rate number, divide it into words and store it.

rate_id_one = []
rate_id_two = []
rate_id_three = []
rate_id_four = []
rate_id_five =[]

for text, rating in zip(review['Text'], review['Rating']):
    line = text.split(" ")
    dic = {}
    if rating == 1:
        for word in line:
            dic[word] = True
        ireru = (dic, 1)
        rate_id_one.append(ireru)
    elif rating == 2:
        for word in line:
            dic[word] = True
        ireru = (dic, 2)
        rate_id_two.append(ireru) 
    elif rating == 3:
        for word in line:
            dic[word] = True
        ireru = (dic, 3)
        rate_id_three.append(ireru)
    elif rating == 4:
        for word in line:
            dic[word] = True
        ireru = (dic, 4)
        rate_id_four.append(ireru)
    else:
        for word in line:
            dic[word] = True
        ireru = (dic, 5)
        rate_id_five.append(ireru)

Now that we have sorted by number, we will divide it into 8: 2 with learning data and test data.

The total is the sum of each number for the learning test.

threshold = 0.8
num_one = int(threshold * len(rate_id_one))
num_two = int(threshold * len(rate_id_two))
num_three = int(threshold * len(rate_id_three))
num_four = int(threshold * len(rate_id_four))
num_five = int(threshold * len(rate_id_five))

features_train = rate_id_one[:num_one] + rate_id_two[:num_two] + rate_id_three[:num_three] + rate_id_four[:num_four] + rate_id_five[:num_five]
features_test = rate_id_one[num_one:] + rate_id_two[num_two:] + rate_id_three[num_three:] + rate_id_four[num_four:] + rate_id_five[num_five:]
print("Number of training datapoints:", len(features_train))
print("Number of test datapoints:", len(features_test))

Number of training datapoints: 18111
Number of test datapoints: 4530

I have divided it into two parts, one for learning and the other for testing, so I will start learning.

And I made the test data judge the numbers with what I learned, but the correct answer rate seems to be less than half.

classifier = NaiveBayesClassifier.train(features_train)
print('Accuracy of the classifier:', nltk_accuracy(classifier, features_test))

Accuracy of the classifier: 0.4640176600441501

The cause may be that there is a wide range of choices from 1 to 5 and that the absolute number of low-rated sentence data is insufficient. For example, if you choose between 1 and 5 rates, it may improve.

Alternatively, if other methods are used, the accuracy may be further improved, so this will be an issue for the future.

Let's see what words influenced the predicted numbers during the training.

N = 15
print('Top ' + str(N) + ' most informative words:')
for i, item in enumerate(classifier.most_informative_features()[:N]):
    print(str(i+1) + '. ' + item[0]) 

Top 15 most informative words:
1. worst
2. shame
3. poorly
4. horrible
5. disappointment.
6. cheap.
7. strange.
8. sad.
9. dull
10. terrible.
11. returned.
12. terrible
13. awkward.
14. concept
15. awful.

Negative words such as worst, shame, and disappointment stand out.

A straightforward expression of negativeness will be the decisive factor that influences the numbers.

Some words have a period, but this time I will look at this as one of the words.

Let's actually write a review

Now let's write our own sentences and predict the rate.

def extract_features(words):
    return dict([(word, True) for word in words])
#Divide into words like the large amount of review text earlier

input_review = input()
print("Clothes review predictions:")

print("\nReview:",input_review)
features = extract_features(input_review.split())
probabilities = classifier.prob_classify(features)
predicted_sentiment = probabilities.max()
print("Predicted sentiment:", predicted_sentiment)
print("Probability:", round(probabilities.prob(predicted_sentiment), 2))
#Calculate and output which number the input text is most likely to correspond to

For example, enter "I cannnot believe how terrible is it!" Here. In Japanese, it means "I can't believe this terrible!"

I cannnot believe how terrible is it!
Clothes review predictions:

Review: I cannnot believe how terrible is it!
Predicted sentiment: 1
Probability: 0.61

It turns out that it is likely to be the lowest rating.

At the end

This time, after touching on what natural language processing and sentiment analysis are, we actually implemented natural language processing using the dataset in Kaggle.

Even if you are a beginner in programming, it is quite possible to implement it if you get the necessary data and take appropriate steps. And the steps actually taken are as follows.

Load the library and data
Check the data, process it, and prepare it for learning.
Learn and check performance

If you understand the general setup and can actually write and move it, you can apply it to other data sets. So it's a good idea to hold this down and use it in your own dataset.

Finally, I will post this code, so please refer to it.

This code

import numpy as np
import pandas as pd

from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy

review = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
#Directory needs to be adjusted accordingly
review = pd.DataFrame(review)


review.columns
Index(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
       'Recommended IND', 'Positive Feedback Count', 'Division Name',
       'Department Name', 'Class Name'],
      dtype='object')

review = review.rename(columns={"Unnamed: 0":"ID",'Review Text': 'Text'})
review = review[["ID","Text","Rating"]]#Only ID, text and numbers are used here.

review.isnull().sum()
ID          0
Text      845
Rating      0
dtype: int64

review = review.dropna()
review.isnull().sum()
ID        0
Text      0
Rating    0
dtype: int64

review.head()#Show the beginning of the data
review.describe()#View the appearance of the data

review.dtypes
ID         int64
Text      object
Rating     int64
dtype: object

rate_id_one = []
rate_id_two = []
rate_id_three = []
rate_id_four = []
rate_id_five =[]

for text, rating in zip(review['Text'], review['Rating']):
    line = text.split(" ")
    dic = {}
    if rating == 1:
        for word in line:
            dic[word] = True
        ireru = (dic, 1)
        rate_id_one.append(ireru)
    elif rating == 2:
        for word in line:
            dic[word] = True
        ireru = (dic, 2)
        rate_id_two.append(ireru) 
    elif rating == 3:
        for word in line:
            dic[word] = True
        ireru = (dic, 3)
        rate_id_three.append(ireru)
    elif rating == 4:
        for word in line:
            dic[word] = True
        ireru = (dic, 4)
        rate_id_four.append(ireru)
    else:
        for word in line:
            dic[word] = True
        ireru = (dic, 5)
        rate_id_five.append(ireru)

rate_id_one[0]#Show the words in the list

len(rate_id_one)
821

threshold = 0.8
num_one = int(threshold * len(rate_id_one))
num_two = int(threshold * len(rate_id_two))
num_three = int(threshold * len(rate_id_three))
num_four = int(threshold * len(rate_id_four))
num_five = int(threshold * len(rate_id_five))

features_train = rate_id_one[:num_one] + rate_id_two[:num_two] + rate_id_three[:num_three] + rate_id_four[:num_four] + rate_id_five[:num_five]
features_test = rate_id_one[num_one:] + rate_id_two[num_two:] + rate_id_three[num_three:] + rate_id_four[num_four:] + rate_id_five[num_five:]
print("Number of training datapoints:", len(features_train))
print("Number of test datapoints:", len(features_test))
Number of training datapoints: 18111
Number of test datapoints: 4530

classifier = NaiveBayesClassifier.train(features_train)
print('Accuracy of the classifier:', nltk_accuracy(classifier, features_test))

Accuracy of the classifier: 0.4640176600441501

N = 15
print('Top ' + str(N) + ' most informative words:')
for i, item in enumerate(classifier.most_informative_features()[:N]):
    print(str(i+1) + '. ' + item[0]) 

Top 15 most informative words:
1. worst
2. shame
3. poorly
4. horrible
5. disappointment.
6. cheap.
7. strange.
8. sad.
9. dull
10. terrible.
11. returned.
12. terrible
13. awkward.
14. concept
15. awful.

def extract_features(words):
    return dict([(word, True) for word in words])

#See what you typed on the spot
input_review = input()
print("Clothes review predictions:")

print("\nReview:",input_review)
features = extract_features(input_review.split())
probabilities = classifier.prob_classify(features)
predicted_sentiment = probabilities.max()
print("Predicted sentiment:", predicted_sentiment)
print("Probability:", round(probabilities.prob(predicted_sentiment), 2))

I cannnot believe how terrible is it!
Clothes review predictions:

Review: I cannnot believe how terrible is it!
Predicted sentiment: 1
Probability: 0.61