[PYTHON] 100 language processing knocks for those who do not understand the meaning of problem sentences Chapter 8 Machine learning

Chapter 4 Morphological Analysis, Chapter 5 Dependency Analysis and Language Processing 100 I have solved this knock with Python3. For the time being, I don't need the contents of "Chapter 6: Processing English Text" and "Chapter 7: Database", so I will skip them and proceed with "Chapter 8: Machine Learning". ..

As for my knowledge level, I have never used Python at work, I took one introductory lesson at Coursera, I am a complete amateur in natural language processing / machine learning, but I am planning to use it at work from now on. It's a feeling.

From this chapter, I can no longer understand what the problem statement is saying, so I will write down the explanation of the terms.

Data used in this chapter

--English reviews of about 10,000 movies. It contains about 5,000 positive and 5,000 negative reviews each.

What you are doing in this chapter

--Create a model that predicts whether each review is positive or negative from each review sentence. --Supervised learning --Logistic regression ――Since it is long, I will explain it by dividing it into 5 stages. 1. Feature design => 2. Learning => 3. Validation => 4.5 Learning and validation by cross-validation => 5. Observe changes in precision and recall due to changes in threshold

Referenced site

-JAIST Language Information Processing Glossary -Negative / Positive Thinking Document Classification Memo -Kitanozaka memorandum 100 language processing knock 2015 version (73) -Gihyo.jp Machine Learning Let's Get Started 18th Logistic Regression

0. Preparation (70. Obtaining and shaping data)

In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).

Create the correct answer data (sentiment.txt) as follows, using the correct answer data of the polarity analysis of the sentence.

Add the string "+1" to the beginning of each line in rt-polarity.pos (polarity label "+1" followed by a space followed by positive statement content) Add the string "-1" to the beginning of each line in rt-polarity.neg (polarity label "-1" followed by a space followed by a negative statement) Concatenate the contents of 1 and 2 above and sort the lines randomly After creating sentiment.txt, check the number of positive examples (positive sentences) and the number of negative examples (negative sentences).

Confirmation of terms

--_ Polarity __: When a linguistic expression such as a sentence, phrase, or word has a positive or negative meaning, "affirmative" or "negative" is called the polarity of the linguistic expression. Automatically determining polarity is a basic technique for reputation information processing.

code

ml/Main.py


from ml.Sentiment import Sentiments, Util
from ml.Model import LogisticRegression
import matplotlib.pyplot as plt

sentiments = Sentiments()
sentiments.restore_from_pos_neg_file('../8_ML/rt-polarity.pos', '../8_ML/rt-polarity.neg')
sentiments.shuffle()
sentiments.save('../8_ML/70_sentiment.txt')

ml/Sentiments.py


import random
import re
from itertools import chain
from nltk.stem.porter import PorterStemmer
from stop_words import get_stop_words
from ml.Model import LogisticRegression

class Sentiments:
    """
Class that manages the list of reviews
    """

    def __init__(self) -> None:
        self.all = []

    def restore_from_pos_neg_file(self, pos_file: str, neg_file: str) -> None:
        """
Keep positive and negative sentences as a list with polar labels.
As a polar label for positive text'1'As a polar label for negative sentences'-1'Put on.
        :param pos_file A file that saves positive text(latin_1)
        :param neg_file File containing negative text(latin_1)
        """
        with open(pos_file, encoding='latin_1') as input_file:
            self.all += [{'label': 1, 'sentence': line.replace('\n', '')} for line in input_file]

        with open(neg_file, encoding='latin_1') as input_file:
            self.all += [{'label': -1, 'sentence': line.replace('\n', '')} for line in input_file]

    def shuffle(self):
        random.shuffle(self.all)

1. Feature design (71. Stopword / 72. Feature extraction)

--71. Create a list of English stop words (stop list) appropriately. Furthermore, implement a function that returns true if the word (character string) given as an argument is included in the stop list, and false otherwise. In addition, write a test for that function.

--72. Design the features that are likely to be useful for polarity analysis, and extract the features from the training data. As a feature, the minimum baseline would be the one with the stop words removed from the review and each word stemmed.

Confirmation of terms

--__ stop words __: A list of words that should be removed from index terms in information retrieval. It is composed of words that are not considered to be effective for information retrieval, such as ancillary words and words with general meanings such as be and have. Since the number of stop words is limited, they are often created manually in advance. I'm a little confused about "make it properly", but when I took a quick look at how other people are doing, "make it really texto", "make it by morphological analysis + frequency analysis", "general on the net" It seems that some people are taking methods such as "picking up a list and writing it solidly." I'm going to use a python package called stop-words here.

--__ Feature : Refers to information that can be used as a clue to classify data during machine learning. Also known as feature for machine learning. For example, when machine learning a model that estimates the part of speech of a word, the words and part of speech that appear before and after the word are used as learning features. In other words, learn a model that estimates the part of speech of the target word using the words and part of speech that appear before and after it as clues. What is used as a learning feature is an important factor in determining the success or failure of natural language processing based on machine learning. -- Learning data _: Data used to automatically learn a classification model in machine learning. Training data. It's the review data (sentiments list) saved in sentiment.txt. -- Stemming process (Stemming) : Stem = stem. The stem is also called the word base, and is the part that does not change in the inflection of a word. For example, in the case of the verb "throw", "throw" is the stem. Stemming is the conversion of the word "throw" into "throw". In Python, it can be executed with the stem method of the PorterStemmer class in the nltk.stem.porter package. (It is taken up in "52. Stemming".) -- Designing features __: Each person designs a feature that seems to be useful for polarity analysis. .. .. What should I do? In addition to the stop word removal + stemming process, which is the minimum baseline in the problem statement, the number of occurrences is too high (10,000 times or more) or too low (3 times or less), and it seems that the polarity will not be affected. I tried removing the words from their features, but I didn't get good results in the end, so I'll proceed with the minimum baseline here.

code

We will add a feature list for each review that sentiments has.

ml/Main.py


sentiments.add_features()
sentiments.save('../8_ML/72_sentiment.txt')

ml/Sentiments.py


class Sentiments:

    #Introduced method
    def __init__(self) -> None: ...
    def restore_from_pos_neg_file(self, pos_file: str, neg_file: str) -> None: ...

  #Below, 1.Feature design(71.Stop word/ 72.Feature extraction)Method to add with

    def add_features(self):
        stemmer = PorterStemmer()

        #We will add feature information to the training data
        for sentiment in self.all:
            words = re.compile(r'[,.:;\s]').split(sentiment['sentence'])
            features = []

            for word in words:
                stem = stemmer.stem(word)
                # if not (stop_words.is_stop_word(stem) or is_freq_word(stem) or stem == ''):
                if not (Util.is_stop_word(stem) or stem == ''):
                    features.append(stem)

            sentiment['features'] = list(set(features))

class Util:
    stop_words = get_stop_words('english')

    @staticmethod
    def is_stop_word(_word: str) -> bool:
        """
Returns true if the word (string) given in the argument is included in the stoplist, false otherwise
        :param _word word word
        :True if the word (string) given in the return bool argument is included in the stoplist, false otherwise
        """
        return _word in Util.stop_words

2. Learning (73. Learning / 75. Weight of features)

--- Learn the logistic regression model using the features extracted in 72. --Check the top 10 features with high weight and the top 10 features with low weight in the logistic regression model learned in 75.73.

Confirmation of terms

--__ Learn logistic regression model __ I don't understand enough to explain the exact definition, but he said, "From the correspondence between label and features created in 72, calculate the" weight vector "using the identification function (sigmoid function)." I will understand it tentatively.

--__ weight vector __ A collection of values that represent how each feature affects the result (in Python, it is represented by a dict type). For example, if you run the following program, you will see dictweights such as {...,'perfect': 1.829376624682014,'remark': 1.8879018752394459,'dull': -2.8891666516371806,'bore': -3.153237996082115, ... You can get it. This is likely to be a positive review if the word perfect or remark (stem) is included in the review text, and a negative review if the word dull or bore is included. It can be interpreted as such. It's a number that seems intuitively reasonable.

--____ Identification function (sigmoid function) __ A function that predicts the possibility of a positive review using a weight vector and a feature list as inputs. Internally, we don't care about the calculation logic here.

--__ Learning rate __ Adjust how much the parameter moves with a single update with a reasonable positive value. The higher the learning rate, the faster the learning, but the prediction probability is not stable. It seems that it is common to set the initial value to about 0.1 and gradually decrease it as learning progresses, but here, after trial and error, it is set to 0.6.

code

ml/Main.py


model = LogisticRegression()
model.calc_weights(sentiments=sentiments.all)

ml/Model.py


import math
from collections import defaultdict


class LogisticRegression:
    def __init__(self):
        self.weights = defaultdict(float)  #Weight vector

    def predict(self, _features: list) -> float:
        """
Discriminating function:Predict the possibility of a positive review using the weight vector and feature list as inputs
        :param _features A list of features organized by review text
        :return Probability of a positive review
        """
        #Inner product of weight vector and feature list
        x = sum([self.weights[feature] for feature in _features])

        #Sigmoid function
        return 1.0 / (1.0 + math.exp(-x))

    def update(self, _features: list, _label: int, _eta: float) -> None:
        """
Update weight vector.
        :param _features A list of features organized by review text
        :param _label Label attached to the review text(Positive review:+1 /Negative review:-1)
        :param _eta learning rate
        """
        #Answer predicted by the discriminant function(Probability of being a positive review)
        predict_answer = self.predict(_features)

        #Whether it is actually a positive review(Convert labels to probabilities(-1 => 0.0000, 1 => 1.0000))
        actual_answer = (float(_label) + 1) / 2

        #Update weight vector
        for _feature in _features:
            _dif = _eta * (predict_answer - actual_answer)

            #Do not update if the difference gets too close to 0
            if 0.0002 > abs(self.weights[_feature] - _dif):
                continue

            self.weights[_feature] -= _dif

    def calc_weights(self, eta0: float = 0.6, etan: float = 0.9999, sentiments: list = None) -> None:
        """
Calculate weight vector
        :param eta0 Initial learning rate
        :param etan Learning rate reduction rate
        :param sentiments List of dictionaries containing review labels, sentences, and feature lists
        """
        for idx, sentiment in enumerate(sentiments):
            self.update(sentiment['features'], sentiment['label'], eta0 * (etan ** idx))

    def save_weights(self, file_name: str) -> None:
        """
Write the weight vector to a file(Sort)
        :param file_name file name
        """
        with open(file_name, mode='w', encoding='utf-8') as output:
            for k, v in sorted(self.weights.items(), key=lambda x: x[1]):
                output.write('{}\t{}\n'.format(k, v))

    def restore_weights(self, file_name: str) -> dict:
        """
Restore weight vector from file
        :param file_name File name where the weight vector is stored
        :return weight vector
        """
        weights = {}
        with open(file_name, encoding='utf-8') as input_file:
            for line in input_file:
                key, value = line.split()
                weights[key] = float(value)

        self.weights = weights

3. Verification (74. Forecast / 76. Labeling / 77. Measurement of accuracy rate)

--Using the logistic regression model learned in 73. 73, implement a program that calculates the polarity label ("+1" for a positive example, "-1" for a negative example) of a given sentence and its prediction probability. .. --76. Apply the logistic regression model to the training data and output the correct label, predicted label, and predicted probability in tab-delimited format. --77. Create a program that receives the output of 76 and calculates the correct answer rate of the prediction, the correct answer rate for the correct example, the recall rate, and the F1 score.

Confirmation of terms

--Precision rate for positive cases: Numbers that could be predicted as regular cases / Numbers that were predicted to be regular cases --Recall rate: Number of positive examples that could be predicted as regular examples / Number of actual positive examples --F1 score Harmonic mean of precision rate and recall rate = (2 * precision rate * recall rate) / (match rate + recall rate)

code

ml/Mian.py


sentiments.add_predict(model.predict)
score = sentiments.calc_score()
Util.print_score(score, '77.Measurement of correct answer rate')

ml/Sentiments.py


class Sentiments:

    #Introduced method
    def __init__(self) -> None: ...
    def restore_from_pos_neg_file(self, pos_file: str, neg_file: str) -> None: ...
    def add_features(self) -> None: ...

  #Below, the method to be added here
    def add_predict(self, predict_func: classmethod, threshold: float = 0.0):
        for sentiment in self.all:
            probability = predict_func(sentiment['features'])
            sentiment['probability'] = probability
            if probability > 0.5 + threshold:
                sentiment['predict_label'] = 1
            elif probability < 0.5 - threshold:
                sentiment['predict_label'] = -1
            else:
                sentiment['predict_label'] = 0

    def calc_score(self) -> dict:
        count = 0
        correct_count = 0
        actual_positive_count = 0
        predict_positive_count = 0
        correct_positive_count = 0
        for sentiment in self.all:
            count += 1
            correct = int(sentiment['label']) == int(sentiment['predict_label'])
            positive = int(sentiment['label']) == 1
            predict_positive = int(sentiment['predict_label']) == 1
            if correct:
                correct_count += 1
            if positive:
                actual_positive_count += 1
            if predict_positive:
                predict_positive_count += 1
            if correct and predict_positive:
                correct_positive_count += 1

        precision_rate = correct_positive_count / predict_positive_count
        recall_rate = correct_positive_count / actual_positive_count
        f_value = (2 * precision_rate * recall_rate) / (precision_rate + recall_rate)

        return {
            'correct_rate': correct_count / count,
            'precision_rate': precision_rate,
            'recall_rate': recall_rate,
            'f_value': f_value
        }


class Util:
    #Introduced method
    def is_stop_word(_word: str) -> bool: ...

    #Below, the method to be added here
    @staticmethod
    def print_score(score: dict, title: str = '') -> None:
        print('\n{}\n\t Prediction accuracy rate: {}\n\t Compliance rate for positive examples: {}\n\t recall: {}\n\tF1 score: {}'.format(
            title, score['correct_rate'], score['precision_rate'], score['recall_rate'], score['f_value']))

Output result

It seems that it should not be removed greatly.

77.Measurement of correct answer rate
Prediction accuracy rate: 0.8743200150065654
Compliance rate for positive cases: 0.8564029290944811
Recall: 0.8994560120052523
F1 score: 0.8774016468435498

4.5 Learning and validation by 5-fold cross-validation (78.5 split cross-validation)

In the experiment of 76-77, the case used for learning was also used for evaluation, so it cannot be said to be a valid evaluation. That is, the classifier evaluates the performance when memorizing the training case, and does not measure the generalization performance of the model. Therefore, find the correct answer rate, precision rate, recall rate, and F1 score for polarity classification by 5-fold cross-validation.

Confirmation of terms

--__ 5 split cross-validation __: A method of dividing the training data into 5 parts, using 4 for training, and building and evaluating a model using 1 for testing, 5 times with different combinations.

code

ml/Main.py


sentiments.restore('../8_ML/72_sentiment.txt')  #Restore unlearned sentiments.
score = sentiments.cross_validation(model=LogisticRegression())
Util.print_score(score, '78.5-fold cross-validation')

ml/Sentiments.py


class Sentiments:

    #Introduced method
    def __init__(self) -> None: ...
    def restore_from_pos_neg_file(self, pos_file: str, neg_file: str) -> None: ...
    def add_features(self) -> None: ...
    def add_predict(self, predict_func: classmethod, threshold: float = 0.0) -> None: ...
    def calc_score(self) -> dict: ...

  #Below, the method to be added here
    def restore(self, file: str):
        _sentiments = []
        with open(file, encoding='utf-8') as input_file:
            for line in input_file:
                _label, _sentence, _features_str, _probability, _predict_label = line.split('\t')
                _sentiments.append({
                    'label': int(_label),
                    'sentence': _sentence,
                    'features': _features_str.split(' '),
                    'probability': 0 if _probability == '' else float(_probability),
                    'predict_label': 0 if _predict_label.rstrip() == '' else float(_predict_label)
                })

        self.all = _sentiments

    def cross_validation(self, _divide_count: int = 5, model: LogisticRegression = None, threshold: float = 0.0) -> dict:
        divided_list = Util.divide_list(self.all, _divide_count)
        _scores = []

        for i in range(_divide_count):
            #Learning
            learning_data = list(chain.from_iterable([divided_list[x] for x in [_i for _i in range(_divide_count) if _i != i]]))
            model.calc_weights(sentiments=learning_data)

            #test
            self.all = divided_list[i]
            self.add_predict(model.predict, threshold)
            _scores.append(self.calc_score())

        return {
            'correct_rate': sum([_score['correct_rate'] for _score in _scores]) / _divide_count,
            'precision_rate': sum([_score['precision_rate'] for _score in _scores]) / _divide_count,
            'recall_rate': sum([_score['recall_rate'] for _score in _scores]) / _divide_count,
            'f_value': sum([_score['f_value'] for _score in _scores]) / _divide_count
        }

class Util:
    #Introduced method
    def is_stop_word(_word: str) -> bool: ...
    def print_score(score: dict, title: str = '') -> None: ...

    #Below, the method to be added here
    @staticmethod
    def divide_list(lst: list, count: int) -> list:
        """
Split the list by the specified number
        :param lst List to split
        :param count How many to divide
        :return list Split list
        """
        divided_list = []
        list_len = len(lst) / count

        for _i in range(count):
            begin_index = int(_i * list_len)
            end_index = int((_i + 1) * list_len if _i + 1 < count else len(lst))
            divided_list.append(lst[begin_index:end_index])

        return divided_list

Output result

Of course, it's lower than 77, but it's not that big of a deterioration.

78.5-fold cross-validation
Prediction accuracy rate: 0.848988423671968
Compliance rate for positive cases: 0.8481575029900081
Recall: 0.852642297684391
F1 score: 0.8503632552717463

5. Observe changes in precision and recall due to changes in threshold (79. Fit-draw graph of recall)

Draw a precision-recall graph by changing the classification threshold of the logistic regression model.

Confirmation of terms

--Threshold: So far, we have determined that the predicted label is positive if the probability is 0.5 or more, and negative otherwise. It seems that this criterion is called the threshold.

code

Change the threshold from 0.0 to 0.45 in 0.05 increments and observe the transition of precision and recall.

ml/Main.py


precision_rates = []
recall_rates = []
thresholds = [t / 20 for t in range(10)]
for threshold in thresholds:
    sentiments.restore('../8_ML/72_sentiment.txt')  #Restore unlearned sentiments.
    score = sentiments.cross_validation(model=LogisticRegression(), threshold=threshold)
    precision_rates.append(score['precision_rate'])
    recall_rates.append(score['recall_rate'])
print(thresholds)
print(precision_rates)
print(recall_rates)

plt.plot(thresholds, precision_rates, label="precision", color="red")
plt.plot(thresholds, recall_rates, label="recall", color="blue")

plt.xlabel("threshold")
plt.ylabel("rate")
plt.xlim(-0.05, 0.5)
plt.ylim(0, 1)
plt.title("Logistic Regression")
plt.legend(loc=3)

plt.show()

Output result

Raising the threshold means predicting only when it is absolutely certain, so the higher the threshold, the higher the precision rate. On the contrary, the recall rate will decrease because it predicts only when it is absolutely certain.

figure_1.png

Recommended Posts

100 language processing knocks for those who do not understand the meaning of problem sentences Chapter 8 Machine learning
[Language processing 100 knocks 2020] Chapter 6: Machine learning
The first step of machine learning ~ For those who want to implement with python ~
Image inspection machine for those who do not do their best
100 Language Processing Knock 2020 Chapter 6: Machine Learning
How to use machine learning for work? 01_ Understand the purpose of machine learning
For those of you who glance at the log while learning with machine learning ~ Muscle training with LightGBM ~
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
The first step for those who are amateurs of statistics but want to implement machine learning models in Python
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
100 language processing knocks ~ Chapter 1
100 language processing knocks Chapter 2 (10 ~ 19)
Deep learning ASCII art auto tracer installation procedure for people who do not understand programming
A story about a student who does not know the machine learning machine learned machine learning (deep learning) for half a year
Loose articles for those who want to start natural language processing
[Language processing 100 knocks 2020] Chapter 3: Regular expressions
100 natural language processing knocks Chapter 4 Commentary
100 language processing knocks 2020: Chapter 4 (morphological analysis)
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
[Language processing 100 knocks 2020] Chapter 1: Preparatory movement
[Language processing 100 knocks 2020] Chapter 7: Word vector
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 language processing knocks 2020: Chapter 3 (regular expression)
[Language processing 100 knocks 2020] Chapter 8: Neural network
[Language processing 100 knocks 2020] Chapter 2: UNIX commands
[Language processing 100 knocks 2020] Chapter 9: RNN, CNN
[Language processing 100 knocks 2020] Chapter 4: Morphological analysis
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
Python learning memo for machine learning by Chainer Chapter 13 Basics of neural networks
For those who want to use Jupyter Notebook as soon as 1 second because they do not know the password