Natural language processing that is almost always found in machine learning.
The words seem to be difficult, so it's hard to pinpoint.
However, it is actually very indispensable for us to spend our daily life.
Moreover, even beginners can fully shape it if they prepare the necessary data set and know how to design it.
Therefore, this time, we will actually create a program from Python using natural language processing.
The challenge here is sentiment analysis. Based on a data set that collects evaluations of women's clothing, we will predict which evaluation the newly entered text corresponds to.
That said, let's start with the actual coding by telling you what natural language processing is and what sentiment analysis is.
In explaining what natural language processing is, it is necessary to clarify the difference between the two languages.
The two languages are natural language and programming language. The difference between the two lies in the many possible meanings and interpretations of the sentence.
An example of natural language is "a girl with big black eyes". With this alone, at least two interpretations can be made.
One is "black", "big eyes", and "girl", that is, a girl with big eyes and white skin. The other is "black eyes" and "big girl", that is, a tall girl with black eyes.
As you can see, there are some ambiguities in words in natural language.
An example of a programming language is "4 * 6 + 1".
In this case, it can be interpreted as multiplying 4 by 6 and adding 1, so there is no room for thinking about other patterns.
In this way, programming languages are unambiguous because they allow computers to always interpret and operate the same sentences in the same way.
Natural language processing is a technology for practically handling a huge amount of text data based on the ambiguity of natural language.
Among them, sentiment analysis is quantified based on the emotional elements in the text. It can be expected to be used for various feedbacks such as analyzing the evaluation of products on the Internet.
This time based on a dataset "Womens Clothing E-Commerce Reviews.csv" that summarizes reviews from Kaggle to women's clothing. We will perform natural language processing.
What is used as sentiment analysis is the numerical value of the evaluation of the clothes in it.
Here, the numbers are in 5 stages from 1 to 5, analyze what words are used in each evaluation, and finally make it possible to predict which evaluation the sentence you entered corresponds to. ..
Since I touched on the data set used for natural language processing and what to target for sentiment analysis, I will explain the creation process step by step.
First, load the necessary libraries and data.
import numpy as np
import pandas as pd
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy
review = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
#Directory needs to be adjusted accordingly
review = pd.DataFrame(review)
Let's take a look at each role.
numpy
numpy is a library included in Python that efficiently performs numerical calculations. When it comes to machine learning, the model is learned by repeating operations on multidimensional arrays such as vectors and matrices. At that time, if you use numpy, you can calculate efficiently. So the existence of this library is indispensable.
pandas
Pandas efficiently perform the work required for data analysis. In the process of data analysis, pre-processing up to machine learning accounts for 80% to 90% of the total. Specifically, arrange the data neatly so that machine learning can be performed properly, such as reading data and filling in missing values. Pandas has all the features you need, so you can work efficiently.
nltk
nltk is a platform for creating programs that process human language data in Python. It is arranged so that various processes such as sentence analysis and classification can be performed.
Let's take a look at the overview of the data and the missing values by looking at each role.
review.columns
Index(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
'Recommended IND', 'Positive Feedback Count', 'Division Name',
'Department Name', 'Class Name'],
dtype='object')
review = review.rename(columns={"Unnamed: 0":"ID",'Review Text': 'Text'})
#"Unnamed: 0"To"ID"、'Review Text'To'Text'という風に呼称To変更する。
review = review[["ID","Text","Rating"]]
#Only IDs, texts and numbers are used here.
#isnull().sum()See the number of missing values with.
review.isnull().sum()
ID 0
Text 845
Rating 0
dtype: int64
#It can be seen that there are 845 cases in which the text is not written just by adding the evaluation numbers.
#Here dropna()Use to delete such data.
review = review.dropna()
review.isnull().sum()
ID 0
Text 0
Rating 0
dtype: int64
#You can now delete it.
review.head()#Display the beginning of the data. head()use.
ID | Text | Rating |
---|---|---|
0 | Absolutely wonderful - silky and sexy and comf... | 4 |
1 | Love this dress! it's sooo pretty. i happene... | 5 |
2 | I had such high hopes for this dress and reall... | 3 |
3 | I love, love, love this jumpsuit. it's fun, fl... | 5 |
4 | This shirt is very flattering to all due to th... | 5 |
review.describe()#View the appearance of the data. describe describe()use.
#The number of data is 22,It turns out that 461 reviews, nearly half of the reviews, are 5.
|ID|Rating|
|:-----------|------------:|:------------:| |count |22641.000000 |22641.000000| |mean |11740.849035 |4.183561| |std |6781.957509 |1.115762| |min |0.000000 |1.000000| |25% |5872.000000 |4.000000| |50% |11733.000000 |5.000000| |75% |17621.000000 |5.000000| |max |23485.000000 |5.000000|
review.dtypes#Use dtypes. If there is any inconvenience in data processing or operation, it may be necessary to change the type.
ID int64
Text object
Rating int64
dtype: object
Now that we have confirmed the data and deleted the missing values, let's start learning immediately.
First of all, divide the sentences of all data into a list for each rate number, divide it into words and store it.
rate_id_one = []
rate_id_two = []
rate_id_three = []
rate_id_four = []
rate_id_five =[]
for text, rating in zip(review['Text'], review['Rating']):
line = text.split(" ")
dic = {}
if rating == 1:
for word in line:
dic[word] = True
ireru = (dic, 1)
rate_id_one.append(ireru)
elif rating == 2:
for word in line:
dic[word] = True
ireru = (dic, 2)
rate_id_two.append(ireru)
elif rating == 3:
for word in line:
dic[word] = True
ireru = (dic, 3)
rate_id_three.append(ireru)
elif rating == 4:
for word in line:
dic[word] = True
ireru = (dic, 4)
rate_id_four.append(ireru)
else:
for word in line:
dic[word] = True
ireru = (dic, 5)
rate_id_five.append(ireru)
Now that we have sorted by number, we will divide it into 8: 2 with learning data and test data.
The total is the sum of each number for the learning test.
threshold = 0.8
num_one = int(threshold * len(rate_id_one))
num_two = int(threshold * len(rate_id_two))
num_three = int(threshold * len(rate_id_three))
num_four = int(threshold * len(rate_id_four))
num_five = int(threshold * len(rate_id_five))
features_train = rate_id_one[:num_one] + rate_id_two[:num_two] + rate_id_three[:num_three] + rate_id_four[:num_four] + rate_id_five[:num_five]
features_test = rate_id_one[num_one:] + rate_id_two[num_two:] + rate_id_three[num_three:] + rate_id_four[num_four:] + rate_id_five[num_five:]
print("Number of training datapoints:", len(features_train))
print("Number of test datapoints:", len(features_test))
Number of training datapoints: 18111
Number of test datapoints: 4530
I have divided it into two parts, one for learning and the other for testing, so I will start learning.
And I made the test data judge the numbers with what I learned, but the correct answer rate seems to be less than half.
classifier = NaiveBayesClassifier.train(features_train)
print('Accuracy of the classifier:', nltk_accuracy(classifier, features_test))
Accuracy of the classifier: 0.4640176600441501
The cause may be that there is a wide range of choices from 1 to 5 and that the absolute number of low-rated sentence data is insufficient. For example, if you choose between 1 and 5 rates, it may improve.
Alternatively, if other methods are used, the accuracy may be further improved, so this will be an issue for the future.
Let's see what words influenced the predicted numbers during the training.
N = 15
print('Top ' + str(N) + ' most informative words:')
for i, item in enumerate(classifier.most_informative_features()[:N]):
print(str(i+1) + '. ' + item[0])
Top 15 most informative words:
1. worst
2. shame
3. poorly
4. horrible
5. disappointment.
6. cheap.
7. strange.
8. sad.
9. dull
10. terrible.
11. returned.
12. terrible
13. awkward.
14. concept
15. awful.
Negative words such as worst, shame, and disappointment stand out.
A straightforward expression of negativeness will be the decisive factor that influences the numbers.
Some words have a period, but this time I will look at this as one of the words.
Now let's write our own sentences and predict the rate.
def extract_features(words):
return dict([(word, True) for word in words])
#Divide into words like the large amount of review text earlier
input_review = input()
print("Clothes review predictions:")
print("\nReview:",input_review)
features = extract_features(input_review.split())
probabilities = classifier.prob_classify(features)
predicted_sentiment = probabilities.max()
print("Predicted sentiment:", predicted_sentiment)
print("Probability:", round(probabilities.prob(predicted_sentiment), 2))
#Calculate and output which number the input text is most likely to correspond to
For example, enter "I cannnot believe how terrible is it!" Here. In Japanese, it means "I can't believe this terrible!"
I cannnot believe how terrible is it!
Clothes review predictions:
Review: I cannnot believe how terrible is it!
Predicted sentiment: 1
Probability: 0.61
It turns out that it is likely to be the lowest rating.
This time, after touching on what natural language processing and sentiment analysis are, we actually implemented natural language processing using the dataset in Kaggle.
Even if you are a beginner in programming, it is quite possible to implement it if you get the necessary data and take appropriate steps. And the steps actually taken are as follows.
Load the library and data
Check the data, process it, and prepare it for learning.
Learn and check performance
If you understand the general setup and can actually write and move it, you can apply it to other data sets. So it's a good idea to hold this down and use it in your own dataset.
Finally, I will post this code, so please refer to it.
import numpy as np
import pandas as pd
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy
review = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
#Directory needs to be adjusted accordingly
review = pd.DataFrame(review)
review.columns
Index(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
'Recommended IND', 'Positive Feedback Count', 'Division Name',
'Department Name', 'Class Name'],
dtype='object')
review = review.rename(columns={"Unnamed: 0":"ID",'Review Text': 'Text'})
review = review[["ID","Text","Rating"]]#Only ID, text and numbers are used here.
review.isnull().sum()
ID 0
Text 845
Rating 0
dtype: int64
review = review.dropna()
review.isnull().sum()
ID 0
Text 0
Rating 0
dtype: int64
review.head()#Show the beginning of the data
review.describe()#View the appearance of the data
review.dtypes
ID int64
Text object
Rating int64
dtype: object
rate_id_one = []
rate_id_two = []
rate_id_three = []
rate_id_four = []
rate_id_five =[]
for text, rating in zip(review['Text'], review['Rating']):
line = text.split(" ")
dic = {}
if rating == 1:
for word in line:
dic[word] = True
ireru = (dic, 1)
rate_id_one.append(ireru)
elif rating == 2:
for word in line:
dic[word] = True
ireru = (dic, 2)
rate_id_two.append(ireru)
elif rating == 3:
for word in line:
dic[word] = True
ireru = (dic, 3)
rate_id_three.append(ireru)
elif rating == 4:
for word in line:
dic[word] = True
ireru = (dic, 4)
rate_id_four.append(ireru)
else:
for word in line:
dic[word] = True
ireru = (dic, 5)
rate_id_five.append(ireru)
rate_id_one[0]#Show the words in the list
len(rate_id_one)
821
threshold = 0.8
num_one = int(threshold * len(rate_id_one))
num_two = int(threshold * len(rate_id_two))
num_three = int(threshold * len(rate_id_three))
num_four = int(threshold * len(rate_id_four))
num_five = int(threshold * len(rate_id_five))
features_train = rate_id_one[:num_one] + rate_id_two[:num_two] + rate_id_three[:num_three] + rate_id_four[:num_four] + rate_id_five[:num_five]
features_test = rate_id_one[num_one:] + rate_id_two[num_two:] + rate_id_three[num_three:] + rate_id_four[num_four:] + rate_id_five[num_five:]
print("Number of training datapoints:", len(features_train))
print("Number of test datapoints:", len(features_test))
Number of training datapoints: 18111
Number of test datapoints: 4530
classifier = NaiveBayesClassifier.train(features_train)
print('Accuracy of the classifier:', nltk_accuracy(classifier, features_test))
Accuracy of the classifier: 0.4640176600441501
N = 15
print('Top ' + str(N) + ' most informative words:')
for i, item in enumerate(classifier.most_informative_features()[:N]):
print(str(i+1) + '. ' + item[0])
Top 15 most informative words:
1. worst
2. shame
3. poorly
4. horrible
5. disappointment.
6. cheap.
7. strange.
8. sad.
9. dull
10. terrible.
11. returned.
12. terrible
13. awkward.
14. concept
15. awful.
def extract_features(words):
return dict([(word, True) for word in words])
#See what you typed on the spot
input_review = input()
print("Clothes review predictions:")
print("\nReview:",input_review)
features = extract_features(input_review.split())
probabilities = classifier.prob_classify(features)
predicted_sentiment = probabilities.max()
print("Predicted sentiment:", predicted_sentiment)
print("Probability:", round(probabilities.prob(predicted_sentiment), 2))
I cannnot believe how terrible is it!
Clothes review predictions:
Review: I cannnot believe how terrible is it!
Predicted sentiment: 1
Probability: 0.61