[PYTHON] Predict short-lived works of Weekly Shonen Jump by machine learning (Part 2: Learning and evaluation)

1.First of all

This article is a continuation of Predicting short-lived works of Weekly Shonen Jump by machine learning (Part 1: Data analysis). Using the data acquired in the first part, we will implement and evaluate a classifier with a multi-layer perceptron. Hereafter, Jump refers to Weekly Shonen Jump.

result.png

The above figure is a part of the evaluation result. When using the best model (Filtered + Augmented), ** If you enter the publication order [^ publication order] up to the 7th week and the number of colors, there is a 65% chance that works that will be completed within 20 weeks It turned out to be predictable ** [^ jump]. The latest 100 works registered in the Japan Media Arts Database were used for evaluation, and other works were used for learning and parameter adjustment. .. I devised various things, but this performance was the limit with my own power. The details are explained below. The jupyter notebook is here, the source code is here -comic-end).

This article does not express an opinion on Jump's editing policy, and does not appeal to the improper end or continuation of any work. Good luck jump! Good luck manga artist!

[^ Posting order]: The Jump editorial department seems to have denied the questionnaire supreme principle, saying, "We do not necessarily consider only the results of the reader questionnaire." "Jump" editorial department denies rumors of questionnaire supreme principle ... Readers are complicated

[^ Jump]: As mentioned above, in reality, the jump editorial department decides the discontinued work in consideration of various factors. I hope you understand this article as a delusion of a jump fan.

2. Environment construction

2.1 anaconda In [ʻanaconda](https://www.continuum.io/Downloads), create the following virtual environment comic`.

conda create -n comic python=3.5
source activate comic
conda install pandas matplotlib jupyter notebook scipy scikit-learn seaborn scrapy
pip install tensorflow

The yml file is here. tensorflow and scikit-learn are included. Also, since I used pairplot () in the first part, seaborn /) Is inserted.

2.2 Table of contents information

It is assumed that the wj-api.json obtained in Part 1 is in the data directory. Also, assume that the ComicAnalyzer introduced in Part 1 is defined in comic.py.


import comic

wj = comic.ComicAnalyzer()

2.3 Module

I want to display the title of the manga in Japanese, so set it referring to Draw Japanese with matplotlib on Ubuntu. If you are using other than Ubuntu, please take appropriate action.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

import matplotlib
from matplotlib.font_manager import FontProperties
font_path = '/usr/share/fonts/truetype/takao-gothic/TakaoPGothic.ttf'
font_prop = FontProperties(fname=font_path)
matplotlib.rcParams['font.family'] = font_prop.get_name()

3. Model

3.1 Problem setting

In this article, we will challenge the problem of classifying whether it is a short-lived work or not based on the following input.

input

As input, we will use a total of 8 dimensions of information on the order of publication in each week up to the 7th week of serialization and the total number of colors. The reason for using the data up to the 7th week is that I wanted to predict the termination of the shortest serialization (8 weeks) in recent years, at the latest one week before. The reason for using the number of colors as well as the order of publication is to improve the prediction accuracy. Intuitively, the more popular works tend to have more colors.

Short-lived work

In Part 1, "short-lived works" are defined as follows.

In this article, we use machine learning to predict short-lived works (works that finish within 10 weeks).

As a preliminary experiment, I tried to classify short-lived works with this definition, but I did not learn well. Analyzing wj-api.json again, we can see that very few works are completed within 10 weeks.

cdf.png

The figure on the left shows the cumulative distribution of all works, and the figure on the right focuses on up to 50 weeks in the figure on the left. The horizontal axis is the publication period, and the vertical axis is the percentage. From the figure on the right, it can be seen that less than 10% of the works were completed by 10 weeks. Why neural networks can't beat SVM As you pointed out, Multilayer Perceptron is not good at learning unbalanced data [^ Commitment] ..

According to Applying deep learning to real-world problems --Merantix, when the data label is biased A change in labeling has been proposed as one of the countermeasures. Therefore, in this article, for the sake of convenience, the definition of short-lived works will be changed to ** works completed within 20 weeks ** (forecasting works completed within 10 weeks will be my homework in the future ...). If the threshold is set to 20 weeks, about half of the works can be treated as short-lived works.

[^ Commitment]: Then, it is reasonable to point out that SVM should be used. This time, I was particular about Perceptron for studying.

3.2 Multilayer Perceptron

The following is a model of Multilayer Perceptron dealt with in this article. For more information on Multilayer Perceptron, see Notes on Backpropagation Method.

model.png

The hidden layer is 7 nodes and 2 layers. As the activation function of the hidden layer, [ReLU](https://ja.wikipedia.org/wiki/%E6%B4%BB%E6%80%A7%E5%8C%96%E9%96%A2%E6 % 95% B0 # ReLU.EF.BC.88.E3.83.A9.E3.83.B3.E3.83.97.E9.96.A2.E6.95.B0.EF.BC.89) .. The output layer outputs the probability that it is a short-lived work, and as an activation function [Sigmoid](https://ja.wikipedia.org/wiki/%E6%B4%BB%E6%80%A7%E5%8C%96 % E9% 96% A2% E6% 95% B0 # .E3.82.B7.E3.82.B0.E3.83.A2.E3.82.A4.E3.83.89.E9.96.A2.E6. Use 95.B0). Use Adam for learning. The learning rate $ r $ is adjusted with TensorBoard. By the way, the above model (number of hidden layers, number of hidden layer nodes, hidden layer activation function, optimization algorithm) is selected with the best performance in the preliminary experiment.

3.3 Data set

In this article, we will use 273 short-lived works and 273 other works (hereinafter referred to as continuous works), for a total of 546 works. From new work, 100 works are used as test data, 100 works are used as validation data, and 346 works are used as training data. The test data is the data for final evaluation, the validation data is the data for hyperparameter adjustment, and the training data is the data for training. For these, Why do we need to separate validation and test sets for supervised learning? is detailed.

In this article, we will use training data in the following three different ways. x_test and y_test represent test data, x_val and y_val represent validation data, and x_tra and y_tra represent training data.

dataset.png

In Dataset 1, all 346 training data works are used for learning. Dataset 2 excludes about half of the old works from the training data and uses them for learning. This is because I thought that some of the training data works were too old to be suitable for learning the current censoring policy of the jump editorial department (become noise). In Dataset 3, Dataset 2 is inflated by dataset augmentation and used for learning. This is because I thought that Dataset 2 had too little training data to obtain sufficient generalization performance.

Dataset augmentation is a technique for processing data to inflate training data. It is known to be effective mainly in improving the performance of image recognition and voice recognition. For details, see Section 7.4 of the Deep learning book and How to increase the number of machine learning dataset images. Please refer to / bohemian916 / items / 9630661cd5292240f8c7). The theme behind this article is to evaluate the effectiveness of Dataset augmentation in predicting the discontinuation of weekly comic books. In this article, Data augmentation is performed by the method shown below.

aug.png

Roughly speaking, new data is generated by randomly selecting two pieces of data with the same label and taking their random weighted averages. Behind this, there is an assumption that works with intermediate grades (in order of publication) of multiple short-lived works are also short-lived works. Intuitively, it seems like a not-so-bad assumption.

4. Implementation

The class ComicNet () for managing the multi-layer perceptron is defined below. ComicNet () sets various data (test, validation, and train), builds a multi-layer perceptron, trains, and tests. For implementation, use TensorFlow. Regarding TensorFlow, I'm not a programmer or data scientist in particular, but I've touched Tensorflow for a month, so it's super easy to understand / items / c977c79b76c5979874e8) is detailed.

ComicNet()


class ComicNet():
    """This class manages a multi-layer perceptron that identifies whether a manga work is short-lived or not.
    :param thresh_week: Threshold that separates short-lived works from others.
    :param n_x: Number of listing orders to enter in the Multilayer Perceptron.
    """
    def __init__(self, thresh_week=20, n_x=7):
        self.n_x = n_x
        self.thresh_week = thresh_week        

The following is a brief description of each member function.

4.1 Dataset settings: configure_dataset () etc.

ComicNet


    def get_x(self, analyzer, title):
        """It is a function to get the normalized publication order of the specified work up to the specified week."""
        worsts = np.array(analyzer.extract_item(title)[:self.n_x])
        bests = np.array(analyzer.extract_item(title, 'best')[:self.n_x])
        bests_normalized = bests / (worsts + bests - 1)
        color = sum(analyzer.extract_item(title, 'color')[:self.n_x]
                    ) /self.n_x
        return np.append(bests_normalized, color)

    def get_y(self, analyzer, title, thresh_week):
        """This is a function to get whether the specified work is a short-lived work."""
        return int(len(analyzer.extract_item(title)) <=  thresh_week)

    def get_xs_ys(self, analyzer, titles, thresh_week):
        """A function that returns the features, label, and title of the specified work group.
          y==0 and y==Returns the same number of data of 1.
        """
        xs = np.array([self.get_x(analyzer, title) for title in titles])
        ys = np.array([[self.get_y(analyzer, title, thresh_week)] 
                       for title in titles])
        
        # ys==0 and ys==Align the number of data of 1.
        idx_ps = np.where(ys.reshape((-1)) == 1)[0]
        idx_ng = np.where(ys.reshape((-1)) == 0)[0]
        len_data = min(len(idx_ps), len(idx_ng))
        x_ps = xs[idx_ps[-len_data:]]
        x_ng = xs[idx_ng[-len_data:]]
        y_ps = ys[idx_ps[-len_data:]]
        y_ng = ys[idx_ng[-len_data:]]
        t_ps = [titles[ii] for ii in idx_ps[-len_data:]]
        t_ng = [titles[ii] for ii in idx_ng[-len_data:]]
        
        return x_ps, x_ng, y_ps, y_ng, t_ps, t_ng
        
    def augment_x(self, x, n_aug):
        """A function that artificially generates a specified number of x data."""
        if n_aug:
            x_pair = np.array(
                [[x[idx] for idx in 
                  np.random.choice(range(len(x)), 2, replace=False)]
                 for _ in range(n_aug)])
            weights = np.random.rand(n_aug, 1, self.n_x + 1)
            weights = np.concatenate((weights, 1 - weights), axis=1)
            x_aug = (x_pair * weights).sum(axis=1)
            
            return np.concatenate((x, x_aug), axis=0)
        else:
            return x
        
    def augment_y(self, y, n_aug):
        """A function that artificially generates a specified number of y data."""
        if n_aug:
            y_aug = np.ones((n_aug, 1)) if y[0, 0] \
                else np.zeros((n_aug, 1))
            return np.concatenate((y, y_aug), axis=0)
        else:
            return y
        
    def configure_dataset(self, analyzer, n_drop=0, n_aug=0):
        """A function that sets a dataset.
        :param analyzer:An instance of the ComicAnalyzer class
        :param n_drop:Number of old data to exclude from training data
        :param n_aug:Number of augmented data to add to training data
        """
        x_ps, x_ng, y_ps, y_ng, t_ps, t_ng = self.get_xs_ys(
            analyzer, analyzer.end_titles, self.thresh_week)
        self.x_test = np.concatenate((x_ps[-50:], x_ng[-50:]), axis=0)
        self.y_test = np.concatenate((y_ps[-50:], y_ng[-50:]), axis=0)
        self.titles_test = t_ps[-50:] + t_ng[-50:]
        self.x_val = np.concatenate((x_ps[-100 : -50], 
                                     x_ng[-100 : -50]), axis=0)
        self.y_val = np.concatenate((y_ps[-100 : -50], 
                                     y_ng[-100 : -50]), axis=0)
        self.x_tra = np.concatenate(
            (self.augment_x(x_ps[n_drop//2 : -100], n_aug//2), 
             self.augment_x(x_ng[n_drop//2 : -100], n_aug//2)), axis=0)
        self.y_tra = np.concatenate(
            (self.augment_y(y_ps[n_drop//2 : -100], n_aug//2), 
             self.augment_y(y_ng[n_drop//2 : -100], n_aug//2)), axis=0)

For configure_dataset (), first input (x_ps, x_ng), label (y_ps, y_ng) and work name (t_ps, t_ng) with get_xs_ys (). I will. Here, the number of short-lived work data (x_ps, y_ps, t_ps) is equal to the number of continuous work data (x_ng, y_ng, t_ng). Of these, the latest 100 works are used as test data, the remaining latest 100 works are used as validation data, and too much is used as training data. When setting training data, after excluding old data with a total of n_drop, add inflated data with a total of n_aug.

4.2 Build Computation graph: build_graph ()

ComicNet


    def build_graph(self, r=0.001, n_h=7, stddev=0.01):
        """A function that builds a multi-layer perceptron.
        :param r:Learning rate
        :param n_h:Number of nodes in the hidden layer
        :param stddev:Standard deviation of the initial distribution of variables
        """
        tf.reset_default_graph()
        
        #Input layer and target
        n_y = self.y_test.shape[1]
        self.x = tf.placeholder(tf.float32, [None, self.n_x + 1], name='x')
        self.y = tf.placeholder(tf.float32, [None, n_y], name='y')
        
        #Hidden layer (1st layer)
        self.w_h_1 = tf.Variable(
            tf.truncated_normal((self.n_x + 1, n_h), stddev=stddev))
        self.b_h_1 = tf.Variable(tf.zeros(n_h))
        self.logits = tf.add(tf.matmul(self.x, self.w_h_1), self.b_h_1)
        self.logits = tf.nn.relu(self.logits)
        
        #Hidden layer (second layer)
        self.w_h_2 = tf.Variable(
            tf.truncated_normal((n_h, n_h), stddev=stddev))
        self.b_h_2 = tf.Variable(tf.zeros(n_h))
        self.logits = tf.add(tf.matmul(self.logits, self.w_h_2), self.b_h_2)
        self.logits = tf.nn.relu(self.logits)
        
        #Output layer
        self.w_y = tf.Variable(
            tf.truncated_normal((n_h, n_y), stddev=stddev))
        self.b_y = tf.Variable(tf.zeros(n_y))
        self.logits = tf.add(tf.matmul(self.logits, self.w_y), self.b_y)
        tf.summary.histogram('logits', self.logits)
        
        #Loss function
        self.loss = tf.reduce_mean(
            tf.nn.sigmoid_cross_entropy_with_logits(
                logits=self.logits, labels=self.y))
        tf.summary.scalar('loss', self.loss)
        
        #optimisation
        self.optimizer = tf.train.AdamOptimizer(r).minimize(self.loss)
        self.output = tf.nn.sigmoid(self.logits, name='output')
        correct_prediction = tf.equal(self.y, tf.round(self.output))
        self.acc = tf.reduce_mean(tf.cast(correct_prediction, tf.float32),
            name='acc')
        tf.summary.histogram('output', self.output)
        tf.summary.scalar('acc', self.acc)
        
        self.merged = tf.summary.merge_all()

In the input layer, tf.placeholder defines the input tensor (x) and the teacher label tensor ( y).

In the hidden layer, tf.Variable defines the weight tensor (w_h_1, w_h_2) and bias ( b_h_1, b_h_2). Here, tf.truncated_normal is given as the initial distribution of Variable. truncated_normal is a normal distribution that excludes values outside 2 sigma and is often used. In fact, this truncated_normal standard deviation is one of the important hyperparameters that affect the performance of the model. This time, I looked at the results of the preliminary experiment and set it to 0.01. tf.add, tf.matmul / matmul), tf.nn.relu is used to connect tensors to form a hidden layer. I will. By the way, tf.nn.relu is changed to tf.nn.sigmoid If you rewrite it as .org / api_docs / python / tf / sigmoid # tfnnsigmoid), the activation function will be Sigmoid. A7% E5% 8C% 96% E9% 96% A2% E6% 95% B0 # .E3.82.B7.E3.82.B0.E3.83.A2.E3.82.A4.E3.83.89.E9 You can use .96.A2.E6.95.B0). Please refer to here for the activation functions that can be used in TensorFlow.

The output layer basically performs the same processing as the submerged layer. Especially active in the output layer because it contains an activation function (sigmoid) inside the loss function tf.nn.sigmoid_cross_entropy_with_logits. Note that you do not need to use the conversion function. By passing tf.Variable to tf.summary.scalar, you can check the time change with TensorBoard. Become.

Use tf.train.AdamOptimizer as the optimization algorithm. Please refer to here for the optimization algorithms that can be used with TensorFlow. The final output value logits is rounded off (that is, judged by the threshold value 0.5), and the correct answer rate for the teacher label y is calculated as ʻacc. Finally, merge all log information with all [tf.summary.merge_all`](https://www.tensorflow.org/api_docs/python/tf/summary/merge_all).

4.3 Learning: train ()

In TensorFlow, learning is done in tf.Session. You must always initialize Variable with tf.global_variables_initializer () (otherwise you will get angry) ..

The model is trained by sess.run (self.optimizer). Multiple first arguments of sess.run can be specified by tuple. Also, at the time of sess.run (), it is necessary to assign a value to placeholder in dictionary format. Substitute x_tra and x_tra during training, and substitute x_val and y_val during validation.

You can save the log information for TensorBoard with tf.summary.FileWriter. You can also save the trained model with tf.train.Saver.

ComicNet


    def train(self, epoch=2000, print_loss=False, save_log=False, 
              log_dir='./logs/1', log_name='', save_model=False,
              model_name='prediction_model'):
        """A function that trains a multi-layer perceptron and saves logs and trained models.
        :param epoch:Number of epochs
        :pram print_loss:Whether to output the history of the loss function
        :param save_log:Whether to save the log
        :param log_dir:Log storage directory
        :param log_name:Log save name
        :param save_model:Whether to save the trained model
        :param model_name:Conserved name of trained model
        """
        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer()) #Variable initialization
            
            #Settings for saving logs
            log_tra = log_dir + '/tra/' + log_name 
            writer_tra = tf.summary.FileWriter(log_tra)
            log_val = log_dir + '/val/' + log_name
            writer_val = tf.summary.FileWriter(log_val)        

            for e in range(epoch):
                feed_dict = {self.x: self.x_tra, self.y: self.y_tra}
                _, loss_tra, acc_tra, mer_tra = sess.run(
                        (self.optimizer, self.loss, self.acc, self.merged), 
                        feed_dict=feed_dict)
                
                # validation
                feed_dict = {self.x: self.x_val, self.y: self.y_val}
                loss_val, acc_val, mer_val = sess.run(
                    (self.loss, self.acc, self.merged),
                    feed_dict=feed_dict)
                
                #Save log
                if save_log:
                    writer_tra.add_summary(mer_tra, e)
                    writer_val.add_summary(mer_val, e)
                
                #Loss function output
                if print_loss and e % 500 == 0:
                    print('# epoch {}: loss_tra = {}, loss_val = {}'.
                          format(e, str(loss_tra), str(loss_val)))
            
            #Save model
            if save_model:
                saver = tf.train.Saver()
                _ = saver.save(sess, './models/' + model_name)

4.4 Test: test ()

ComicNet


    def test(self, model_name='prediction_model'):
        """A function that reads and tests the specified model.
        :param model_name:The name of the model to load
        """
        tf.reset_default_graph()
        loaded_graph = tf.Graph()
        
        with tf.Session(graph=loaded_graph) as sess:
            
            #Model loading
            loader = tf.train.import_meta_graph(
                './models/{}.meta'.format(model_name))
            loader.restore(sess, './models/' + model_name)
            
            x_loaded = loaded_graph.get_tensor_by_name('x:0')
            y_loaded = loaded_graph.get_tensor_by_name('y:0')
            
            loss_loaded = loaded_graph.get_tensor_by_name('loss:0')
            acc_loaded = loaded_graph.get_tensor_by_name('acc:0')
            output_loaded = loaded_graph.get_tensor_by_name('output:0')
        
            # test
            feed_dict = {x_loaded: self.x_test, y_loaded: self.y_test}
            loss_test, acc_test, output_test = sess.run(
                (loss_loaded, acc_loaded, output_loaded), feed_dict=feed_dict)
            return acc_test, output_test

test () is a member function that tests a trained multilayer perceptron. Use tf.train.import_meta_graph to load the trained model. Give test data (x_test, y_test) to feed_dict and runsess.run.

5. Experiment

5.1 Hyperparameter adjustment

By visualizing the accuracy (correct answer rate) and loss (loss function output) of validation data with TensorBoard, hyperparameters (learning rate $ r $) (Number of epochs) is tuned. For more information on TensorBoard, please refer to Official. For simplicity, this article adjusts only one significant digit. Although details are omitted, preliminary experiments were conducted on the number of hidden layers (2), the activation function of hidden layers (ReLU), the standard deviation of the initial distribution of variables (0.01), and the optimization algorithm (Adam). It has been easily adjusted with.

rs = [n * 10 ** m for m in range(-4, -1) for n in range(1, 10)]
datasets = [
    {'n_drop':0, 'n_aug':0},
    {'n_drop':173, 'n_aug':0},
    {'n_drop':173, 'n_aug':173},
]

wjnet = ComicNet()

for i, dataset in enumerate(datasets):
    wjnet.configure_dataset(wj, n_drop=dataset['n_drop'], 
                            n_aug=dataset['n_aug'])
    log_dir = './logs/dataset={}/'.format(i + 1)
    for r in rs:
        log_name = str(r)
        wjnet.build_graph(r=r)
        wjnet.train(epoch=20000, save_log=True, log_dir=log_dir, 
                log_name=log_name)
        print('Saved log of dataset={}, r={}'.format(i + 1, r))

For Dataset 1, let's look at the accuracy and loss of validation data with TensorBoard.

tensorboard --logdir=./logs/dataset=1/val

tensorboard.png

The horizontal axis is the number of epochs. From here, look for $ r $ and $ epoch $ that minimize the validation loss.

dataset1.png

For Dataset 1, $ r = 0.0003 $ and $ epoch = 2000 $ seem to be good. Do the same for Dataset 2 and Dataset 3.

dataset2.png

For Dataset 2, $ r = 0.0005 $ and $ epoch = 2000 $ seem to be good.

dataset3.png

For Dataset 3, $ r = 0.0001 $ and $ epoch = 8000 $ seem to be good.

5.2 Learning

For each Dataset, train with the hyperparameters adjusted above and save the model.

params = [
    {'n_drop':0, 'n_aug':0, 'r':0.0003, 
     'e': 2000, 'name':'1: Original'},
    {'n_drop':173, 'n_aug':0, 'r':0.0005, 
     'e': 2000, 'name':'2: Filtered'},
    {'n_drop':173, 'n_aug':173, 'r':0.0001, 
     'e': 8000, 'name':'3: Filtered+Augmented'}
]

wjnet = ComicNet()
for i, param in enumerate(params):
    model_name = str(i + 1)
    wjnet.configure_dataset(wj, n_drop=param['n_drop'],
                            n_aug=param['n_aug'])
    wjnet.build_graph(r=param['r'])
    wjnet.train(save_model=True, model_name=model_name, epoch=param['e'])
    print('Trained', param['name'])

5.3 Rating

Evaluate the performance with ComicNet.test ().

accs = []
outputs = []
for i, param in enumerate(params):
    model_name = str(i + 1)
    acc, output = wjnet.test(model_name)
    accs.append(acc)
    outputs.append(output)
    print('Test model={}: acc={}'.format(param['name'], acc))

plt.bar(range(3), accs, tick_label=[param['name'] for param in params])
for i, acc in enumerate(accs):
    plt.text(i - 0.1, acc-0.3, str(acc), color='w')
plt.ylabel('Accuracy') 
result.png

Even if it is classified randomly, it should be $ acc = 0.5 $, so the result is subtle ... Fortunately, I was able to confirm the effects of Filter and Augmentation.

5.4 Consideration

Let's dig a little deeper into the results of the best performing Model 3 (Filtered + Augmented).

idx_sorted = np.argsort(output.reshape((-1)))
output_sorted = np.sort(output.reshape((-1)))

y_sorted = np.array([wjnet.y_test[i, 0] for i in idx_sorted])
title_sorted = np.array([wjnet.titles_test[i] for i in idx_sorted])

t_ng = np.logical_and(y_sorted == 0, output_sorted < 0.5)
f_ng = np.logical_and(y_sorted == 1, output_sorted < 0.5)
t_ps = np.logical_and(y_sorted == 1, output_sorted >= 0.5)
f_ps = np.logical_and(y_sorted == 0, output_sorted >= 0.5)

weeks = np.array([len(wj.extract_item(title)) for title in title_sorted])
plt.plot(weeks[t_ng], output_sorted[t_ng], 'o', ms=10,
        alpha=0.5, c='b', label='True negative')
plt.plot(weeks[f_ng], output_sorted[f_ng], 'o', ms=10,
        alpha=0.5, c='r', label='False negative')
plt.plot(weeks[t_ps], output_sorted[t_ps], '*', ms=15,
        alpha=0.5, c='b', label='True positive')
plt.plot(weeks[f_ps], output_sorted[f_ps], '*', ms=15,
         alpha=0.5, c='r', label='False positive')
plt.ylabel('Output')
plt.xlabel('Serialized weeks')
plt.xscale('log')
plt.ylim(0, 1)
plt.legend()
scatter.png

The figure above shows the relationship between the actual serialization period and the output of the classifier. Blue is the work that was correctly classified (True), and red is the work that was misclassified (False). Stars are works classified as short-lived works (Positive), and circles are works classified as continuous works (Negative). It is considered that the more blue works and the more concentrated the distribution from the upper left to the lower right on the graph, the better the classification performance.

First of all, I am concerned that there is no output of 0.75 or more. Is learning not going well? It is not well understood…. The next thing to worry about is the False positive in the upper right of the graph. Some popular works serialized for more than 100 weeks have been misclassified as short-lived works. Therefore, let's compare the order of publication (worst) of representative works of each classification result.

plt.figure(figsize=(12, 8))

plt.subplot(2, 2, 1)
for output, week, title in zip(
    output_sorted[t_ps][-5:], weeks[t_ps][-5:], title_sorted[t_ps][-5:]):
    plt.plot(range(1, 8), wj.extract_item(title)[:7], 
             label='{0} ({1:>3}, {2:.2f})'.format(title[:5], week, output))
plt.ylabel('Worst')
plt.ylim(0, 23)
plt.title('Part of True positive (correctly classified short-lived work)')
plt.legend()

plt.subplot(2, 2, 2)
for output, week, title in zip(
    output_sorted[f_ps], weeks[f_ps], title_sorted[f_ps]):
    if week > 100:
        plt.plot(range(1, 8), wj.extract_item(title)[:7], 
                 label='{0} ({1:>3}, {2:.2f})'.format(title[:5], week, output))
plt.ylim(0, 23)
plt.title('Part of False positive (continuation work misclassified as short-lived work)')
plt.legend()
    
plt.subplot(2, 2, 3)
for output, week, title in zip(
    output_sorted[f_ng][:5], weeks[f_ng][:5], title_sorted[f_ng][:5]):
    plt.plot(range(1, 8), wj.extract_item(title)[:7], 
             label='{0} ({1:>3}, {2:.2f})'.format(title[:5], week, output))
plt.xlabel('Weeks')
plt.ylabel('Worst')
plt.ylim(0, 23)
plt.title('Part of False negative (short-lived work misclassified as a continuation work)')
plt.legend()
    
plt.subplot(2, 2, 4)
for output, week, title in zip(
    output_sorted[t_ng][:5], weeks[t_ng][:5], title_sorted[t_ng][:5]):
    plt.plot(range(1, 8), wj.extract_item(title)[:7], 
             label='{0} ({1:>3}, {2:.2f})'.format(title[:5], week, output))
plt.xlabel('Weeks')
plt.ylim(0, 23)
plt.title('Part of a True Negative')
plt.legend()
worsts.png

The horizontal axis is the publication week, and the vertical axis is the publication order counting from the end of the book. The legend shows the title of the work (serialization period, output value). It can be seen that the works with false positives (upper right) have a stronger downward trend in the order of publication up to the 7th week than the works with true negatives (lower right). Conversely, the False positive (upper right) work can be regarded as a popular work that has rewound the inferiority in the early stages. Also, the order of publication of False negative (lower left) works up to 7 weeks has a gentle downward trend, and at least in my eyes, it is indistinguishable from that of True negative (lower right) works. I can understand the reason for the misclassification.

Below, for reference, the output values of all 100 works are plotted.

labels = np.array(['{0} ({1:>3})'.format(title[:6], week)
                   for title, week in zip(title_sorted, weeks) ])

plt.figure(figsize=(4, 18))
plt.barh(np.arange(100)[t_ps], output_sorted[t_ps], color='b')
plt.barh(np.arange(100)[f_ps], output_sorted[f_ps], color='r')
plt.barh(np.arange(100)[f_ng], output_sorted[f_ng], color='r')
plt.barh(np.arange(100)[t_ng], output_sorted[t_ng], color='b')
plt.yticks(np.arange(100), labels)
plt.xlim(0, 1)
plt.xlabel('Output')
for i, out in enumerate(output_sorted):
    plt.text(out + .01, i - .5, '{0:.2f}'.format(out))
output.png

The horizontal axis represents the output value. The parentheses next to the title of the work indicate the serialization period. Blue indicates the correct classification result, and red indicates the incorrect classification result. The closer the output value is to 1, the more it is judged to be a short-lived work.

6. Conclusion

Actually, this article is positioned as the output of what I learned in Deep learning foundation nanodegree [^ nd101]. I started writing. That is why I stubbornly stuck to the multi-layer perceptron. After all, applying machine learning to real-world problems is really hard. If it wasn't for this theme, I would have been absolutely frustrated.

The final performance was disappointing, but it was good to see the effects of filtering and augmentation of the dataset. I think that the performance will improve a little more if you adjust the hyperparameters (n_drop, n_aug) that were decided this time. Alternatively, as you pointed out in Why neural networks cannot beat SVM, other machine learning methods such as SVM may be applied. It may be. I'm exhausted so I won't do it.

Since the first part was released, we have received feedback from many people, both real and online. It's all about Sunday programmers. I hope to work with you in the future. Thank you for reading to the end!

[^ nd101]: I'm a so-called March student. Thank you.

References

In creating this article, I referred to the following. Thank you very much! : bow:

  1. Draw Japanese with matplotlib on Ubuntu: About Japanese output
  2. Applying deep learning to real-world problems --Merantix: When the data label is biased About coping method
  3. Notes on error back propagation method: About multi-layer perceptron in general
  4. Why do supervised learning need to separate validation and test sets? : Handling of various data sets
  5. Ian Goodfellow and Yoshua Bengio and Aaron Courville, Deep Learning, MIT Press, 2016: Dataset augmentation in general (Section 7.4)
  6. How to increase the number of machine learning dataset images: About Dataset augmentation for image data
  7. I'm neither a programmer nor a data scientist, but I've touched Tensorflow for a month, so it's super easy to understand: About TensorFlow
  8. TensorBoard: About hyperparameter adjustment using TensorBoard
  9. Why neural networks cannot beat SVM: Future research policy

Recommended Posts

Predict short-lived works of Weekly Shonen Jump by machine learning (Part 2: Learning and evaluation)
Predict short-lived works of Weekly Shonen Jump by machine learning (Part 1: Data analysis)
Classification of guitar images by machine learning Part 1
Classification of guitar images by machine learning Part 2
Predict the presence or absence of infidelity by machine learning
Significance of machine learning and mini-batch learning
[Machine learning] Summary and execution of model evaluation / indicators (w / Titanic dataset)
Predict power demand with machine learning Part 2
Judgment of igneous rock by machine learning ②
Evaluation method of machine learning regression problem (mean square error and coefficient of determination)
Machine Learning: Image Recognition of MNIST by using PCA and Gaussian Native Bayes
I tried to predict the presence or absence of snow by machine learning.
Machine learning memo of a fledgling engineer Part 1
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
Parallel learning of deep learning by Keras and Kubernetes
Summary of evaluation functions used in machine learning
Analysis of shared space usage by machine learning
[Translation] scikit-learn 0.18 Tutorial Introduction of machine learning by scikit-learn
Machine learning memo of a fledgling engineer Part 2
Reasonable price estimation of Mercari by machine learning
Python learning memo for machine learning by Chainer Chapters 1 and 2
Predict the gender of Twitter users with machine learning
A concrete method of predicting horse racing by machine learning and simulating the recovery rate
I tried to verify the yin and yang classification of Hololive members by machine learning
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
Machine learning to learn with Nogizaka46 and Keyakizaka46 Part 1 Introduction
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Using open data from Data City Sabae to predict water level gauge values by machine learning Part 2