[PYTHON] [Implementation explanation] Classify livedoor news by Japanese BERT x unsupervised learning (information amount maximization clustering): Google Colaboratory (PyTorch)

In this article, we will implement and explain how clustering is performed by maximizing the amount of information in unsupervised learning for livedoor news using the Japanese version of BERT in Google Colaboratory.

--Contents up to vectorizing sentences using the Japanese version of BERT in Google Colaboratory, --Information-maximizing clustering for unsupervised learning in MNIST

I explained about this in the articles serialized so far, so please have a look here first.

** Series list ** [1] [Implementation explanation] How to use Japanese version BERT with Google Colaboratory (PyTorch) [2] [Implementation explanation] Livedoor news classification in Japanese version BERT: Google Colaboratory (PyTorch) [3] [Implementation explanation] Brain science and unsupervised learning. Classify MNIST by information amount maximization clustering [4] * This article [Implementation explanation] Classify livedoor news by Japanese BERT x unsupervised learning (information amount maximization clustering)


Last time, I wrote a paper on MNIST's handwritten numeric images. Invariant Information Clustering for Unsupervised Image Classification and Segmentation

Clustering of unsupervised learning using mutual information IIC(Invariant Information Clustering) Was carried out.

This time, we will show what happens when each news article of livedoor news is vectorized in Japanese BERT and unsupervised learning (clustering) is done in IIC.

In the case of MNIST, the clustering result is divided according to the number label of supervised learning, and it feels good that it can be used for supervised learning, but what about text data?

I will try this.

The flow of this implementation is

  1. Download livedoor news and convert it to PyTorch's DataLoader
  2. Vectorize livedoor news articles with BERT
  3. IIC deep learning model available
  4. Learn the IIC network
  5. Infer test data

It will be.

The implementation code of this post is placed in the following GitHub repository.

GitHub: How to use Japanese version of BERT in Google Colaboratory: Implementation code It is 4_BERT_livedoor_news_IIC_on_Google_Colaboratory.ipynb.

1. Download livedoor news and convert it to PyTorch's DataLoader

The contents so far are [[2] [Implementation explanation] Livedoor news classification in Japanese version BERT: Google Colaboratory (PyTorch)](https: //)

As it is, I will omit the publication in this article.

Finally, create a training, validation, and test DataLoader by doing the following:

#Create a DataLoader (simply called iterater in the context of torchtext)
batch_size = 16  #BERT uses around 16 and 32

dl_train = torchtext.data.Iterator(
    dataset_train, batch_size=batch_size, train=True)

dl_eval = torchtext.data.Iterator(
    dataset_eval, batch_size=batch_size, train=False, sort=False)

dl_test = torchtext.data.Iterator(
    dataset_test, batch_size=batch_size, train=False, sort=False)

#Put together in a dictionary object
dataloaders_dict = {"train": dl_train, "val": dl_eval}

Vectorize livedoor news articles with BERT

Vectorize the article body of livedoor news with the Japanese version BERT.

Document vectorization.

This time, I will simply treat the 768 dimensions of the embedded vector of the first word of BERT ([CLS]) as a document vector (there are various ways to create a document vector and its validity).

Since it is a waste of time to vectorize the document data every time when training the neural network of clustering by IIC, create a DataLoader converted to the document vector by BERT.

The implementation is as follows:

First, prepare the main body of BERT. It is a parameter version of Tohoku University that has already learned Japanese.

from transformers.modeling_bert import BertModel

#BERT's Japanese-learned parameter model
model = BertModel.from_pretrained('bert-base-japanese-whole-word-masking')
model.eval()
print('Network setting completed')

Next, define a function to vectorize with BERT.

#Define a function to vectorize with BERT


def vectorize_with_bert(net, dataloader):

    #Check if GPU can be used
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    print("Device used:", device)
    print('-----start-------')

    #Network to GPU
    net.to(device)

    #If the network is fixed to some extent, speed it up
    torch.backends.cudnn.benchmark = True

    #Mini batch size
    batch_size = dataloader.batch_size

    #Loop to retrieve mini-batch from data loader
    for index, batch in enumerate(dataloader):
        #batch is a Text and Lable dictionary object
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        inputs = batch.Text[0].to(device)  #Sentence
        labels = batch.Label.to(device)  #label

        #Forward calculation
        with torch.set_grad_enabled(False):

            #Input to Ber
            result = net(inputs)

            # sequence_Extract the first word vector of output
            vec_0 = result[0]  #The first 0 is sequence_Show output
            vec_0 = vec_0[:, 0, :]  #All batches. All 768 elements of the first 0th word
            vec_0 = vec_0.view(-1, 768)  #size[batch_size, hidden_size]Conversion to

            #Organize vectorized data into a torch list
            if index == 0:
                list_text = vec_0
                list_label = labels
            else:
                list_text = torch.cat([list_text, vec_0], dim=0)
                list_label = torch.cat([list_label, labels], dim=0)

    return list_text, list_label

Vectorize the training, validation, and test DataLoaders with the defined functions.

#Convert DataLoader to vectorized version
#It takes a little less than 5 minutes

list_text_train, list_label_train = vectorize_with_bert(model, dl_train)
list_text_eval, list_label_eval = vectorize_with_bert(model, dl_eval)
list_text_test, list_label_test = vectorize_with_bert(model, dl_test)

Then make this list a PyToch Dataset.

#Convert torch list to Dataset

from torch.utils.data import TensorDataset

dataset_bert_train = TensorDataset(
    list_label_train.view(-1, 1), list_text_train)
dataset_bert_eval = TensorDataset(list_label_eval.view(-1, 1), list_text_eval)
dataset_bert_test = TensorDataset(list_label_test.view(-1, 1), list_text_test)

Finally, set the Dataset to DataLoader. This DataLoader is used for network learning of IIC (Invariant Information Clustering).

#Make it a Dataloader
from torch.utils.data import DataLoader

batch_size = 1024

dl_bert_train = DataLoader(
    dataset_bert_train, batch_size=batch_size, shuffle=True, drop_last=True)
# drop_last is the last mini-batch batch_Ignore if size is not enough

dl_bert_eval = DataLoader(
    dataset_bert_eval, batch_size=batch_size, shuffle=False)
dl_bert_test = DataLoader(
    dataset_bert_test, batch_size=batch_size, shuffle=False)

With the above, the text of livedoor news has been vectorized in Japanese BERT.

All you have to do now is cluster this vectorized data.

Preparation 3: Prepare IIC deep learning model

Next, prepare a deep learning model of IIC. This part is basically

[Implementation explanation] Brain science and unsupervised learning. Classify MNIST by information amount maximization clustering

It has the same configuration as.

The neural network model should correspond to a 768-dimensional vector. This time, the feature amount is converted by repeating the 1d convolution.

At the same time, clustering that is 10 times the number of clustering actually estimated by overclustering is also performed, and the network that captures minute features is trained.

import torch.nn as nn
import torch.nn.functional as F

OVER_CLUSTRING_RATE = 10


class NetIIC(nn.Module):
    def __init__(self):
        super(NetIIC, self).__init__()

        # multi-head does not do this time
        self.conv1 = nn.Conv1d(1, 400, kernel_size=768, stride=1, padding=0)
        self.bn1 = nn.BatchNorm1d(400)
        self.conv2 = nn.Conv1d(1, 300, kernel_size=400, stride=1, padding=0)
        self.bn2 = nn.BatchNorm1d(300)
        self.conv3 = nn.Conv1d(1, 300, kernel_size=300, stride=1, padding=0)
        self.bn3 = nn.BatchNorm1d(300)

        self.fc1 = nn.Linear(300, 250)
        self.bnfc1 = nn.BatchNorm1d(250)

        #Does it correspond to 9 categories of livedoor news? 9 categories to expect
        self.fc2 = nn.Linear(250, 9)

        # overclustering
        #By clustering more than the actual assumption, it is possible to capture minute changes in the network.
        self.fc2_overclustering = nn.Linear(250, 9*OVER_CLUSTRING_RATE)

    def forward(self, x):
        x = x.view(x.size(0), 1, -1)
        x = F.relu(self.bn1(self.conv1(x)))

        x = x.view(x.size(0), 1, -1)
        x = F.relu(self.bn2(self.conv2(x)))

        x = x.view(x.size(0), 1, -1)
        x = F.relu(self.bn3(self.conv3(x)))

        x = x.view(x.size(0), -1)
        x_prefinal = F.relu(self.bnfc1(self.fc1(x)))

        # multi-Do not use head
        y = F.softmax(self.fc2(x_prefinal), dim=1)
        y_overclustering = F.softmax(self.fc2_overclustering(
            x_prefinal), dim=1)  # overclustering

        return y, y_overclustering

Defines an initialization function for model weight parameters.

import torch.nn.init as init


def weight_init(m):
    """Weight initialization"""
    if isinstance(m, nn.Conv1d):
        init.normal_(m.weight.data)
        if m.bias is not None:
            init.normal_(m.bias.data)
    elif isinstance(m, nn.BatchNorm1d):
        init.normal_(m.weight.data, mean=1, std=0.02)
        init.constant_(m.bias.data, 0)
    elif isinstance(m, nn.Linear):
        # Xavier
        # init.xavier_normal_(m.weight.data)

        # He
        init.kaiming_normal_(m.weight.data)

        if m.bias is not None:
            init.normal_(m.bias.data)

Defines how to calculate the loss function for IID.

Find the mutual information between the output (x_out) when the target vectorized data is input to NetIID and the output (x_tf_out) when the data converted to the target vectorized data is input to NetIID.

See the previous article for more details on the implementation here.

[Implementation explanation] Brain science and unsupervised learning. Classify MNIST by information amount maximization clustering

#Definition of loss function by IIS
#Reference: https://github.com/RuABraun/phone-clustering/blob/master/mnist_basic.py
import sys


def compute_joint(x_out, x_tf_out):
    bn, k = x_out.size()
    assert (x_tf_out.size(0) == bn and x_tf_out.size(1) == k), '{} {} {} {}'.format(
        bn, k, x_tf_out.size(0), x_tf_out.size(1))

    p_i_j = x_out.unsqueeze(2) * x_tf_out.unsqueeze(1)  # bn, k, k
    p_i_j = p_i_j.sum(dim=0)  # k, k
    p_i_j = (p_i_j + p_i_j.t()) / 2.  # symmetrise
    p_i_j = p_i_j / p_i_j.sum()  # normalise
    return p_i_j


def IID_loss(x_out, x_tf_out, EPS=sys.float_info.epsilon):
    # has had softmax applied
    bs, k = x_out.size()
    p_i_j = compute_joint(x_out, x_tf_out)
    assert (p_i_j.size() == (k, k))

    p_i = p_i_j.sum(dim=1).view(k, 1).expand(k, k)
    p_j = p_i_j.sum(dim=0).view(1, k).expand(k, k)

    # avoid NaN losses. Effect will get cancelled out by p_i_j tiny anyway
    #This is version 1 of PyTorch.If it is 3 or more, an error will occur.
    # https://discuss.pytorch.org/t/pytorch-1-3-showing-an-error-perhaps-for-loss-computed-from-paired-outputs/68790/3
    #p_i_j[(p_i_j < EPS).data] = EPS
    #p_j[(p_j < EPS).data] = EPS
    #p_i[(p_i < EPS).data] = EPS

    p_i_j = torch.where(p_i_j < EPS, torch.tensor(
        [EPS], device=p_i_j.device), p_i_j)
    p_j = torch.where(p_j < EPS, torch.tensor([EPS], device=p_j.device), p_j)
    p_i = torch.where(p_i < EPS, torch.tensor([EPS], device=p_i.device), p_i)

    # https://qiita.com/Amanokawa/items/0aa24bc396dd88fb7d2a
    #Added weight alpha for reference

    alpha = 2.0
    loss = (- p_i_j * (torch.log(p_i_j) - alpha *
                       torch.log(p_j) - alpha*torch.log(p_i))).sum()

    return loss

Next, define the transformation to be applied to the vectorized data of interest.

This conversion function is the key to IIC.

In the case of image data, affine conversion (rotation / stretching), aspect ratio change, clipping (clipping), and noise addition are performed according to normal data augmentation.

What should I do with text data ...

In data augmentation in competitions such as Kaggle, we translate it back into another language and then do it again.

This time, we will simply add noise based on the standard deviation of all vector data.

#Definition of a function that adds noise to data
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tensor_std = list_text_train.std(dim=0).to(device)


def perturb_data(x):
    y = x.clone()
    noise = torch.randn(len(tensor_std)).to(device)*tensor_std*2.0
    noise = noise.expand(x.shape[0], -1)
    y += noise

    return y

4: Learn the IIC network

Now that the DataLoader and IIC models are ready, let's train the weights of the IIC model.

First, define the training function.

#Definition of learning function


def train(total_epoch, model, train_loader, optimizer, device):

    #Put network in training mode
    model.train()

    #Learning rate scheduler CosAnnealing
    scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
        optimizer, T_0=2, T_mult=2, eta_min=0)

    for epoch in range(total_epoch):
        for batch_idx, (target, data) in enumerate(train_loader):

            #Change in learning rate
            scheduler.step()

            data_perturb = perturb_data(data)  #Give noise and create converted data

            #Send if sent to GPU
            data = data.to(device)
            data_perturb = data_perturb.to(device)

            #Optimization function initialization
            optimizer.zero_grad()

            #Enter the neural network
            output, output_overclustering = model(data)
            output_perturb, output_perturb_overclustering = model(data_perturb)

            #Loss calculation
            loss1 = IID_loss(output, output_perturb)
            loss2 = IID_loss(output_overclustering,
                             output_perturb_overclustering)
            loss = loss1 + loss2

            #Updated to reduce losses
            loss.backward()
            optimizer.step()

        #Log output
        if epoch % 50 == 0:
            print('Train Epoch {} \tLoss1: {:.6f} \tLoss2: {:.6f} \tLoss_total: {:.6f}'.format(
                epoch, loss1.item(), loss2.item(), loss1.item()+loss2.item()))

    return model, optimizer

In the above training, the learning rate is changed by preparing CosineAnnealingWarmRestarts of the scheduler.

This is a device that changes the learning rate as shown below to make it easier to learn the parameters from the local solution to the global minimum solution when the value is suddenly increased from a small value.

sgdr.jpg Figure: Quote https://www.kaggle.com/c/imet-2019-fgvc6/discussion/94783

We carry out learning. This time, the batch size is 1,024 and the data itself is about 4,000, so we are increasing the number of epochs.

It takes less than 5 minutes to learn.

#Implementation of learning (less than 5 minutes)

total_epoch = 1000
optimizer = torch.optim.Adam(net.parameters(), lr=5e-4)  #Optimization function

model_trained, optimizer = train(
    total_epoch, net, dl_bert_train, optimizer, device)

Now that the training is complete, let's infer with the test data. To make it easier to understand the results, prepare a data Loader for testing with mini-batch size 1 and infer one by one. Store the result.

The function is defined below.

#Check the results of model classification clusters
import numpy as np

#Prepare Data Loader for mini-batch size 1 test
dl_bert_test = DataLoader(
    dataset_bert_test, batch_size=1, shuffle=False)


def test(model, device, test_loader):
    model.eval()

    out_targs = []
    ref_targs = []

    #Prepare a list for output
    total_num = len(test_loader)
    # index, (target_label, inferenced_label)
    output_list = np.zeros((total_num, 2))

    with torch.no_grad():
        for batch_idx, (target, data) in enumerate(test_loader):
            data = data.to(device)
            target = target.to(device)
            outputs, outputs_overclustering = model(data)

            #Add classification results to list
            out_targs.append(outputs.argmax(dim=1).cpu())
            ref_targs.append(target.cpu())

            #Put the results in a list
            output_list[batch_idx, 0] = target.cpu()  #Correct label
            output_list[batch_idx, 1] = outputs.argmax(dim=1).cpu()  #Forecast label

    out_targs = torch.cat(out_targs)
    ref_targs = torch.cat(ref_targs)

    return out_targs.view(-1, 1).numpy(), ref_targs.numpy(), output_list

5. Infer test data

Perform inference with test data.

#Infer with test data

out_targs, ref_targs, output_list = test(model_trained, device, dl_bert_test)

Check the confusion matrix frequency table of the inference results in the test data.

#Make a confusion matrix
matrix = np.zeros((9, 9))

#Create a correct answer class for livedoor news vertically and a frequency table for the classes judged horizontally
for i in range(len(out_targs)):
    row = ref_targs[i]
    col = out_targs[i]
    matrix[row][col] += 1

np.set_printoptions(suppress=True)
print(matrix)

The output result is as follows.

[[ 55.   0.   1.   0.   4.  47.   2.  76.   0.]
 [  3.  40.   4.   0.  14.   1.   1.   0. 116.]
 [  7.  39.  21.   4.  16.   3.   6.   3.   1.]
 [ 11.  60.  16.   4.  13.   8.  27.   2.  17.]
 [  8.   6.  20. 107.   1.   8.  16.   0.   1.]
 [ 11.  17.  15.   0.  40.   6.  78.   7.   0.]
 [ 18.   3.  65.  40.  13.  15.  14.   3.   1.]
 [ 63.   7.  45.  11.   2.  42.   7.   1.   1.]
 [ 27.   0.   6.   0.   4.  61.   1.  61.   1.]]

In addition, the vertical axis is the following order, checking the contents of dic_id2cat.

{0: 'sports-watch',
 1: 'dokujo-tsushin',
 2: 'livedoor-homme',
 3: 'peachy',
 4: 'smax',
 5: 'movie-enter',
 6: 'it-life-hack',
 7: 'kaden-channel',
 8: 'topic-news'}

First, the clusters estimated to be classes, as in MNIST, do not match neatly. I think this is something like that.

If you are instructed to "classify a large number of documents into several types," I think that even humans will classify (cluster) in various patterns. When I was asked to classify livedoor news into 9 categories, this time I got this result.

The second estimated cluster is mostly "it-life-hack" and "kaden-channel". The third of the estimated clusters is "smax (smartphones and gadgets)". The last inferred cluster seems to be a cluster with a lot of "single communication".

The penultimate is mostly "sports-watch" and "topic-news". The "sports-watch" and "topic-news" classes are in the estimated cluster, It is separated into 0th, 5th, and 7th.

The text of the 5th cluster and the text of the 7th cluster of "sports-watch" The text of the 5th cluster and the text of the 7th cluster in "topic-news"

Let's take a look at the features of cluster 5 and cluster 7.

#Check the result of the cluster
#「sports-5th cluster text of "watch", 7th cluster text
#「topic-5th cluster text of "news", 7th cluster text
#Let's take a look at the features of cluster 5 and cluster 7.

import pandas as pd

df2 = pd.DataFrame(output_list)
df2.columns=["Correct answer class", "Estimated cluster"]
df2.head()
df2[(df2['Correct answer class']==0) & (df2['Estimated cluster']==5)].head()

The output is as follows.

Correct class Estimated cluster
21	0.0	5.0
59	0.0	5.0
126	0.0	5.0
142	0.0	5.0
153	0.0	5.0

Let's take a look at the 21st and 59th sentences in the test dataset.

#The original document is put in df. Look at about 300 characters.
print(df.iloc[21, 0][:300])
print(df.iloc[59, 0][:300])

The first sentence of the fifth cluster of "sports-watch" is

On the 29th of last month, after retiring as a professional wrestler, Kotetsu Yamamoto, who was loved by fans as a player training, commentator, and referee, died suddenly due to hypoxic encephalopathy, giving deep sadness to fans and related parties. At that time, "Weekly Asahi Geino" released on the 7th of this week reported an amazing action in the "NEWS SHOT!" Corner, which shows Mr. Yamamoto's "immediately before his death" heroism. Former editor-in-chief of weekly professional wrestling, Tarzan Yamamoto, commented on the magazine, saying, "The body, which is 170 cm tall and weighs 113 kg, still maintains muscles that are reminiscent of the active era. It is as harsh as young people. I trained and had an unbelievable appetite when I was 68 years old, but in fact Mr. Yamamoto was diabetic, "he said.

have become.

Similarly,

The first sentence of the 7th cluster of "sports-watch" is

Interview with former Chunichi Dragons director Hiromitsu Ochiai, where shocking words such as "everything went crazy", "I have never touched the pitcher", and "no one trusts me" popped out one after another. .. Baseball commentator Suguru Egawa was the listener on Nippon Television's "Going! Sports & News," and the pattern was broadcast over two nights (17th and 18th). In the first broadcast, Mr. Ochiai revealed an unknown episode, saying, "If I wear a uniform next year, I will not talk so far," but the second part of the broadcast is even more surprising. It became the content that should be. Below is a summary of the interview. Egawa: Did you think you could be the best in Japan this year? I didn't think

is. The first sentence of the fifth cluster of "topic-news" is

In South Korea, it is a general rule that monks are single, but the monk's Facebook, which says, "I was desperate and decided to go home without being popular from the opposite sex," has become a hot topic on Korean online bulletin boards. The monk, named Hyobon, said on Facebook on the 19th, "I was in my twenties and I wasn't popular with the opposite sex. Get out of the house. If you think about it now, you should have rapped and punked when you were young, "he said. Regarding the life of a monk, he said, "The sooner you go home, the faster you get bigger. If you get bigger, you don't have to worry about rice and laundry."

is. The first sentence of the 7th cluster of "topic-news" is

On the 18th, the Nihon Keizai Shimbun reported the distorted reality of the disaster area in an article entitled "Another Abnormal Situation in the Disaster Area: Special Demand for Reconstruction / Compensation for Nuclear Power Plants ... Money Inflow Distorted Regeneration" and caused controversy on the online bulletin board. ing. In the same article, he pointed out that if the family to be compensated is a family of five, 800,000 yen a month will be in the pocket, "I got money from TEPCO, pachinko, sushi, smartphone and bag new even if I do not work. The words of the local people, "I'm doing it," are posted. The compensation that flows to the victims and how to use it is positioned as "distorted regeneration". On the online bulletin board, "I didn't donate at all, I didn't have a blind spot."

is.

You can't really understand the characteristics just by looking at the text. .. ..

However, you can see that the atmosphere of articles is very similar between sports and news.

More than this, to get a better idea of the characteristics of IIC clusters,

--Look at the trend of word frequency with wordcloud --Tweak all the vectors of the cluster document to create some vector of the cluster, decide the representative document and put out similar words

Such operations can be considered.

It's too long to do that in this article, so that's it for this time.

The implementation code of this post is placed in the following GitHub repository.

GitHub: How to use Japanese version of BERT in Google Colaboratory: Implementation code It is 4_BERT_livedoor_news_IIC_on_Google_Colaboratory.ipynb.

The above is an implementation example and explanation for livedoor news as IIC clustering for unsupervised learning in Japanese BERT.

Thank you for reading.


[Remarks] The AI Technology Department development team that I lead is looking for members. Click here if you are interested

[Disclaimer] The content of this article itself is the opinion / transmission of the author, not the official opinion of the company to which the author belongs.


Recommended Posts

[Implementation explanation] Classify livedoor news by Japanese BERT x unsupervised learning (information amount maximization clustering): Google Colaboratory (PyTorch)
[Implementation explanation] How to use the Japanese version of BERT in Google Colaboratory (PyTorch)
The relationship between brain science and unsupervised learning. Maximize information amount Unsupervised learning MNIST: Google Colabratory (PyTorch)