This post is the 25th day article of "Natural Language Processing Advent Calendar 2019 --Qiita".

This is siny.

In this article, we have summarized the creation of a negative-positive classifier using BERT, which plays a major role in natural language processing as of 2019.

Introduction

I think that knowledge about BERT has been widely distributed in books, blogs, Qiita, etc. However, most of the datasets that can be used for natural language processing are based on English, and there are not many Japanese datasets, so there are few cases and information about using BERT using Japanese text. felt.

Currently, I think the following are major Japanese datasets that can be used for free.

-Aozora Bunko -Twitter Japanese Reputation Analysis Dataset -SNOW D18 Japanese Emotional Expression Dictionary -livedoor news corpus

When I was looking for "** Is there a data set that has a certain amount of data and is Japanese text data that can be used for free? **", chABSA-dataset I found a Japanese dataset (about 2800 data) called chakki-works / chABSA-dataset).

chABSA-dataset is a Japanese dataset created based on the securities reports of listed companies. Each sentence contains not only a negative / positive emotional classification, but also information that expresses the perspective of "what" is negative / positive. The following is sample data of ** chABSA-dataset **.

{
  "header": {
    "document_id": "E00008",
    "document_name": "Hokuto Corporation",
    "doc_text": "Securities report",
    "edi_id": "E00008",
    "security_code": "13790",
    "category33": "Fisheries / Agriculture and Forestry",
    "category17": "Food",
    "scale": "6"
  },
  "sentences": [
    {
      "sentence_id": 0,
      "sentence": "In the current consolidated fiscal year, the Japanese economy improved its corporate performance and employment / income environment due to the government's economic policies and the Bank of Japan's monetary easing measures....",
      "opinions": [
        {
          "target": "Japan economy",
          "category": "NULL#general",
          "polarity": "neutral",
          "from": 11,
          "to": 16
        },
        {
          "target": "Corporate performance",
          "category": "NULL#general",
          "polarity": "positive",
          "from": 38,
          "to": 42
        },...
      ],
    },
    {
      "sentence_id": 1,
      "sentence": "The environment surrounding the Group is such that consumers are suffering from sluggish real wages....",
      "opinions": [
        {
          "target": "Real wages",
          "category": "NULL#general",
          "polarity": "negative",
          "from": 15,
          "to": 19
        },...
      ]
    },...
  ]
}

"With ** chABSA-dataset **, there are thousands of data, and there are values that express emotions, so it may be used for negative-positive classification?", So I created a BERT negative-positive classification machine with this dataset. I tried it.

In addition, all the implementation code explained in this article is on github below, so please clone it as appropriate. In addition, each process is described in ** BERT model creation-learning-inference.ipynb ** on github, so please refer to it as appropriate.

「chABSA-dataset」(https://github.com/sinjorjob/chABSA-dataset)

Premise
Environment construction
BERT model schematic diagram of negative / positive classification
Creating a dataset for negative / positive learning
Implementation of Tokenizer for BERT
Create DataLoader
Implementation of negative / positive classification model by BERT
BERT fine tuning settings
BERT learning / reasoning
Learning results
Input test text to visualize predictions and Attention
Display inference results and mixed matrices with large amounts of test data
Summary

1. Premise

In this article, we will create a negative-positive classifier based on the following assumptions.

item	meaning
OS	Ubuntu
BERT model	Published by Kyoto Universitypytorch-pretrained-BERT modelFine tuning is performed based on.
Morphological analysis	Juman++ (v2.0.0-rc2) or (v2.0.0-rc3)
Library	Pytorch

2. Environment construction

Build an environment where you can use the BERT Japanese Pretrained model with PyTorch.

Library installation


conda create -n pytorch python=3.6
conda activate pytorch
conda install pytorch=0.4 torchvision -c pytorch
conda install pytorch=0.4 torchvision cudatoolkit -c pytorch
conda install pandas jupyter matplotlib scipy scikit-learn pillow tqdm cython
pip install torchtext
pip install mojimoji
pip install attrdict
pip install pyknp

If you can't get it with conda, I installed it with pip.

Juman ++ installation

The BERT Japanese Pretrained model used this time uses Human ++ (v2.0.0-rc2) for morphological analysis of the input text, so this article also matches the morphological analysis tool to ** Human ++**. The procedure for installing Juman ++ is summarized in a separate article, so please refer to the following.

[** Summary of JUMAN ++ installation procedure **] https://sinyblog.com/deaplearning/juman/

BERT Japanese Pretrained model preparation

The BERT Japanese Pretrained model can be downloaded from the following URL.

[BERT Japanese Pretrained model] http://nlp.ist.i.kyoto-u.ac.jp/index.php?BERT%E6%97%A5%E6%9C%AC%E8%AA%9EPretrained%E3 % 83% A2% E3% 83% 87% E3% 83% AB

Download ** Japanese_L-12_H-768_A-12_E-30_BPE .zip ** from "** Japanese_L-12_H-768_A-12_E-30_BPE.zip (1.6G) ****" on the above HP. When you unzip the zip file, some files are included, but this time you need the following three.

item	meaning
bert_config.json	Config file for BERT model
pytorch_model.bin	pytorch version BERT(pytorch-pretrained-BERT)Converted model for
vocab.txt	BERT Glossary Dictionary Data

The entire directory structure is as follows.

├─data
│  └─chABSA    #chABSA json file
│  └─test.tsv    #Test data
│  └─train.tsv    #Training data
│  └─test_dumy.tsv  #Dummy data
│  └─train_dumy.tsv #Dummy data

├─utils
│  └─bert.py    #BERT model definition
│  └─config.py  #Definition of various paths
│  └─dataloader.py    #For dataloader generation
│  └─predict.py    #For reasoning
│  └─predict.py    #For reasoning
│  └─tokenizer.py   #For morphological analysis
│  └─train.py       #For learning
├─vocab      #bert vocabulary dictionary vocal.txt
└─weights    # bert_config.json、pytorch_model.bin
└─Create_data_from_chABSA.ipynb   #tsv data creation
└─BERT model creation-learning~inference.ipynb   #Data loader creation~Learning~inference

The following files are not stored in the git repository due to their large capacity, so please download the former from the Kyoto University website, learn the latter according to the notebook, and save the model parameters yourself.

pytorch_model.bin（pytorch-pretrained-BERT) bert_fine_tuning_chABSA_22epoch.pth (Negative / Positive Learned Parameter File)

3. BERT model schematic diagram of negative / positive classification

It is a schematic diagram of the BERT model of negative / positive classification to be implemented this time.

The above BERT model is created based on the source code of the book "** Learn while making! Deep learning by PyTorch **". This article does not explain the details of the BERT model, so if you are interested, please refer to the book.

The source code itself is available at the link above.

To explain only the points, the BERT source code itself is based on huggingface / transformers, and ** all for negative-positive classification at the end of the BERT model. A connection layer (Linear **) is added, and the model outputs 2 class classification ** [negative (0) or positive (1)] ** as output. For classification, use the features of the first word [CLS] ** of the entered text data.

4. Creating a dataset for negative / positive learning

chABSA-dataset There are 230 json format data files in the dataset, but as it is, the negative / positive classifier using BERT It cannot be used as training data.

Multiple text data is stored in one json file, and the following information is included.

item	meaning
sentence_id	ID that uniquely identifies the data
sentence	Sentence data
opinions	Some of the options are {target,category,porarity,from,Multiple sets of to} are included.
target	The key word is specified in the sentence for target
category	Industry information
polarity	Is the target keyword positive or negative?
from, to	From what character to what character does the keyword of target exist in sentence?

From these json files, create a tsv dataset that can be used for training as follows. Each line is in the format of "input text 0 (negative) or 1 (positive)".

On the other hand, the outlook remains uncertain due to risks such as the economic slowdown of the Chinese economy, policy management of the new US administration, and Brexit from the UK.
In the cosmetics and general merchandise business, we are strengthening store development by large stores and working to attract customers through digital sales promotion and increase customers by holding events, with sales of 3,262 million yen (15 year-on-year)..5% decrease) 0
In addition, maintenance contracts increased steadily, with sales of 6,952 million yen (1 year-on-year comparison).2% increase) 1
Regarding profits, segment profit (operating profit) is 1 due to an increase in replacement work and securing stable profits through maintenance contracts.,687 million yen (2 compared to the same period of the previous year).4% increase) 1
In other segments, bicycle parking systems performed steadily, with sales of 721 million yen (0 year-on-year)..8% increase) 1

To create the data, execute the code ** Create_data_from_chABSA.ipynb ** in Jupyter notebook.

If you follow the steps, training data (train.tsv) containing 1970 sentences and test data (test.tsv) containing 843 data will be created.

5. Implementation of Tokenizer for BERT

Implements BertTokenizer class for word splitting input sentences in ** utils \ bert.py **. This time we will use the Japanese dataset, but [BERT Japanese Pretrained model](http://nlp.ist.i.kyoto-u.ac.jp/index.php?BERT%E6%97%A5% E6% 9C% AC% E8% AA% 9EPretrained% E3% 83% A2% E3% 83% 87% E3% 83% AB) Make morphological analysis using Human ++ according to the specifications.

Also, [link](http://nlp.ist.i.kyoto-u.ac.jp/index.php?BERT%E6%97%A5%E6%9C%AC%E8%AA%9EPretrained%E3 % 83% A2% E3% 83% 87% E3% 83% AB) As described, the following points are customized for Japanese.

Set the ** --do_lower_case option ** to ** False ** in the BertTokenizer class in bert.py.

Class BertTokenizer(object):
    #Implemented sentence word splitting class for BERT

    def __init__(self, vocab_file, do_lower_case=False):　#Changed to False (different from English model)

Comment out the following in the BasicTokenizer class of tokenizer.py

#text = self._tokenize_chinese_chars(text)  #Comment out because all kanji will be in one character unit

Added ** HumanTokenize class ** for morphological analysis with Human ++ to tokenizer.py.

from pyknp import Juman

class JumanTokenize(object):
    """Runs JumanTokenizer."""
    
    def __init__(self):
        self.juman = Juman()

    def tokenize(self, text):
        result = self.juman.analysis(text)
        return [mrph.midasi for mrph in result.mrph_list()]

If you use the above HumanTokenizer class, the input text will be morphologically analyzed by Human ++ as follows.

cd chABSA-dataset
python
>>>from utils.tokenizer import JumanTokenize
>>>from pyknp import Juman
>>>text = "Ordinary income decreased by 818 million yen from the previous fiscal year, mainly due to a decrease in fund management income such as interest on loans2,It became 27,811 million yen"
>>>juman = JumanTokenize()
>>>print(juman.tokenize(text))
['Ordinary', 'Revenue', 'Is', '、', 'Lending', 'Money', 'Interest', 'Such', '資Money', 'Operation', 'Revenue', 'of', 'Decrease', 'To', 'Main cause', 'To', ' 、', 'Before', 'year', 'ratio', '818 million', 'Circle', 'Decrease', 'Shi', '2,27,811 million', 'Circle', 'When', 'Nari', 'まShiた']
>>>

6. Create DataLoader

Create a DataLoader with torchtext to generate data for training and testing. This time, the DataLoder creation function "** get_chABSA_DataLoaders_and_TEXT **" is created in ** dataloder.py **, so use this.

There seems to be an opinion that it is better not to perform detailed preprocessing when using BERT, but this time we have added the following as preprocessing.

-"** Half-width → Full-width " -" Delete line breaks, half-width spaces, and full-width spaces " -" Unify all numeric characters to 0 " -" Replace symbols other than commas and periods with spaces **"

** get_chABSA_DataLoaders_and_TEXT ** The return value of the function is as follows.

item	meaning
train_dl	Training dataset
val_dl	Validation dataset
TEXT	torchtext.data.field.Field object
dataloaders_dict	Iterator dictionary data for learning and verification data※1

** * 1 ** dataloaders_dict is used for learning and verification.

If you are not sure how to use torchtext, please refer to the following article. pytorch text preprocessing (torchtext) [for beginners]

Below is the code that generates the Dataloader.

from utils.dataloader import get_chABSA_DataLoaders_and_TEXT
from utils.bert import BertTokenizer
train_dl, val_dl, TEXT, dataloaders_dict= get_chABSA_DataLoaders_and_TEXT(max_length=256, batch_size=32)

Let's take out the data from the generated training data (train_dl) and check the contents.

#Operation check Check with the verification data dataset
batch = next(iter(train_dl))
print("Text shape=", batch.Text[0].shape)
print("Label shape=", batch.Label.shape)
print(batch.Text)
print(batch.Label)

As shown below, text data (maximum length is 256) for batch size (32 pieces) is generated in Text (input data). The input data is a numerical list data by converting the word list into an ID. Label contains the correct label for the corresponding sentence, which is 0 (negative) or 1 (positive).

Text shape= torch.Size([32, 256])
Label shape= torch.Size([32])
(tensor([[    2,  3718,   534,  ...,     0,     0,     0],
        [    2, 17249,   442,  ...,     0,     0,     0],
        [    2,   719,  3700,  ...,     0,     0,     0],
        ...,
        [    2,   719,  3700,  ...,     0,     0,     0],
        [    2,    64,     6,  ...,     0,     0,     0],
        [    2,     1,  3962,  ...,     0,     0,     0]]), tensor([68, 48, 31, 30, 33, 89, 55, 49, 53, 29, 61, 44, 21, 69, 51, 48, 30, 32,
        54, 31, 39, 28, 27, 24, 24, 48, 21, 86, 39, 51, 71, 42]))
tensor([0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0,
        1, 1, 0, 1, 0, 1, 0, 0])

Just in case, take one sentence from the mini-batch and pass the digitized list data to the ** ids_to_tokens ** method of ** tokenizer_bert ** to restore the original sentence (word).

#Check the first sentence of the mini batch
tokenizer_bert = BertTokenizer(vocab_file="./vocab/vocab.txt", do_lower_case=False)
text_minibatch_1 = (batch.Text[0][1]).numpy()

#Return ID to word
text = tokenizer_bert.convert_ids_to_tokens(text_minibatch_1)

print(text)


['[CLS]', 'Sales', 'Profit', 'Is', '、', 'Complete', 'Construction', 'Total', 'Profit', 'rate', 'But', 'Improvement', 'did', 'thing', 'From', '、', 'Before', 'Linking', 'Accounting', 'year', 'ratio', '[UNK]', '．', '[UNK]', '％', 'Increase', 'of', '[UNK]', 'Circle', '（', 'Before', 'Linking', 'Accounting', 'year', 'Is', '[UNK]', 'Circle', '）', 'When', 'became', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']

The beginning of the sentence is ** [CLS] **, the end is ** [SEP] **, the unknown word is ** [UNK] **, and the part less than 256 characters is padded with ** [PAD] **. I will.

So far, we have confirmed the creation of the dataset and the mini-batch that is actually generated.

7. Implementation of negative / positive classification model by BERT

Next, we will implement the negative-positive classification model by BERT.

The following BERT model implemented this time is defined as ** BertModel class ** in ** utils \ bert.py **, so this class is used to generate the model.

Use the following files to build the model.

item	Description
bert_config.json	BERT model parameter file
pytorch_model.bin	Trained BERT model

First, create a base BERT model by specifying the config configuration file as an argument in ** BertModel class **, and then learn using the ** set_learned_params ** method defined in ** bert.py . Set the parameters of the completed BERT model ( pytorch_model.bin **). Then, after generating a negative-positive classification model using ** BertForchABSA class **, put it in learning mode with ** net.train () **.

The code to generate the model is below.

from utils.bert import get_config, BertModel,BertForchABSA, set_learned_params

#Read the JSON file of model settings as an object variable
config = get_config(file_path="./weights/bert_config.json")

#Generate base BERT model
net_bert = BertModel(config)

#Trained parameter set for BERT model
net_bert = set_learned_params(
    net_bert, weights_path="./weights/pytorch_model.bin")

#Generate BERT negative-positive classification model(Fully connected layer for negative-positive classification at the end of the model (Linear))Add)
net = BertForchABSA(net_bert)

#Set to training mode
net.train()

8. BERT fine tuning settings

In BERT's original paper, all 12-stage BertLayer (Self-Attention) layers are fine-tuned, but this time only the last 1 layer + negative / positive classifier Is the subject of learning.


#Perform gradient calculation only for the last BertLayer module and the added classification adapter

# 1.Set to False for overall gradient calculation
for name, param in net.named_parameters():
    param.requires_grad = False

# 2.Only the last BertLayer module changed to have gradient calculation
for name, param in net.bert.encoder.layer[-1].named_parameters():
    param.requires_grad = True

# 3.Changed the classifier (negative or positive) to have gradient calculation
for name, param in net.cls.named_parameters():
    param.requires_grad = True

Next, specify the optimizer and loss function to use for training.

Both the final layer of BertLayer (Self-Attention) and the classifier use ** Torch.optim.Adam class **. The learning rate (lr) is ** 5e-e **, and betas is the default value ** (0.9, 0.999) ** (the values in the reference books are used as they are).

And since this time it is a two-class classification of negative or positive, ** torch.nn.CrossEntropyLoss ** is specified for criterion.


#The original part of BERT is fine tuning
optimizer = optim.Adam([
    {'params': net.bert.encoder.layer[-1].parameters(), 'lr': 5e-5},
    {'params': net.cls.parameters(), 'lr': 5e-5}
], betas=(0.9, 0.999))

#Loss function settings
criterion = nn.CrossEntropyLoss()
# nn.LogSoftmax()After calculating nn.NLLLoss(negative log likelihood loss)Calculate

9. BERT learning / reasoning

Next, we will carry out learning and verification. utls.py\train.pyに定義されている学習＆検証用の関数train_modelを使って学習と検証を行います。 Use ** train.tsv (1970) ** data for training and ** test.tsv (843) ** data for validation.

It takes time to learn in a CPU environment, so we recommend using a GPU environment such as Google Coraboratory.

When I tried it in a CPU environment with Core i7 8 cores and 16GB memory, it took about 30 minutes per epoch.


#Conduct learning / verification
from utils.train import train_model

#Perform learning / verification.
num_epochs = 1   #Please change the number of epochs as appropriate.
net_trained = train_model(net, dataloaders_dict,
                          criterion, optimizer, num_epochs=num_epochs)


#Save learned network parameters(This time, the file name is described assuming that the result of turning 22 epoch is saved)
save_path = './weights/bert_fine_tuning_chABSA_22epoch.pth'
torch.save(net_trained.state_dict(), save_path)

The arguments of ** train_model ** are as follows.

item	Description
net	BERT negative / positive classification model
dataloaders_dict	Iterator for learning & verification
criterion	Loss function
optimizer	Optimizer
num_epochs	Number of epochs

When executed, the correct answer rate for each 10 iterations and Lost and Acc for each Epoch will be displayed as shown below.

Device used: cpu
-----start-------
Iteration 10|| Loss: 0.6958 || 10iter: 294.9368 sec. ||Correct answer rate of this iteration: 0.46875
Iteration 20|| Loss: 0.7392 || 10iter: 288.1598 sec. ||Correct answer rate of this iteration: 0.4375
Iteration 30|| Loss: 0.6995 || 10iter: 232.9404 sec. ||Correct answer rate of this iteration: 0.53125
Iteration 40|| Loss: 0.5975 || 10iter: 244.0613 sec. ||Correct answer rate of this iteration: 0.6875
Iteration 50|| Loss: 0.5678 || 10iter: 243.3908 sec. ||Correct answer rate of this iteration: 0.71875
Iteration 60|| Loss: 0.5512 || 10iter: 269.5538 sec. ||Correct answer rate of this iteration: 0.6875
Epoch 1/1 | train |  Loss: 0.6560 Acc: 0.5975
Epoch 1/1 |  val  |  Loss: 0.5591 Acc: 0.7711

10. Learning results

This time, I set the MAX of the epoch number to 50 and compared the accuracy with the following 3 patterns.

--BertLayer (Self-Attention) ** Only the final layer ** Fine tuning --BertLayer (Self-Attention) ** Only the back two layers ** Fine tuning --Fine tuning of ** back 6 layers ** of BertLayer (Self-Attention)

The results are as follows.

The following is a summary of the evaluation.

――In this model, increasing the number of fine-tuning targets had almost no effect on improving accuracy. ――The accuracy reaches around 86% when you turn it about 5 epochs, and even if you increase the number of epochs after that, the accuracy does not increase significantly. ――When it exceeds 20 epoch, overfitting becomes remarkable.

In the end, the correct answer rate was MAX (** 87.76% **) when only the final layer of BertLayer was fine-tuned and turned ** 22epoch **.

11. Input test text to visualize predictions and Attention

Using the learned BERT negative-positive classification model, give a sample sentence to visualize the negative-positive predicted value and Attention (which word was emphasized in the judgment?).

Attention is displayed in html format, so it is easy to understand if you use Jupyter Notebook.

Advance preparation

In order to use the TEXT object (torchtext.data.field.Field) generated by torchtext at the time of inference, dump the TEXT object to the pkl file once.


from utils.predict create_vocab_text
TEXT = create_vocab_text()

When the above code is executed, text.pkl will be generated under \ chABSA-dataset \ data.

The create_vocab_text method is defined in predict.py. Dummy data (train_dumy.tsv, test_dumy.tsv) and BERT glossary data (vocab.txt) under \ chABSA-dataset \ data are used to generate a TEXT object and then output by pickle.

def create_vocab_text():
    TEXT = torchtext.data.Field(sequential=True, tokenize=tokenizer_with_preprocessing, use_vocab=True,
                            lower=False, include_lengths=True, batch_first=True, fix_length=max_length, init_token="[CLS]", eos_token="[SEP]", pad_token='[PAD]', unk_token='[UNK]')
    LABEL = torchtext.data.Field(sequential=False, use_vocab=False)
    train_val_ds, test_ds = torchtext.data.TabularDataset.splits(
        path=DATA_PATH, train='train_dumy.tsv',
        test='test_dumy.tsv', format='tsv',
        fields=[('Text', TEXT), ('Label', LABEL)])
    vocab_bert, ids_to_tokens_bert = load_vocab(vocab_file=VOCAB_FILE)
    TEXT.build_vocab(train_val_ds, min_freq=1)
    TEXT.vocab.stoi = vocab_bert
    pickle_dump(TEXT, PKL_FILE)

    return TEXT

Inference and Attention Visualization Execution

Since the method of building (** build_bert_model ) and inference ( predict **) of the trained model is defined in ** utils \ predict.py **, use this to input the sample text. To visualize the predicted value and Attention. Attention uses IPython to visualize HTML.

from utils.config import *
from utils.predict import predict, build_bert_model
from IPython.display import HTML, display


input_text = "As a result of the above, sales in the current consolidated fiscal year 1,785 million yen(357 million yen decrease from the same period of the previous year, 16.7% decrease), Operating loss 117 million yen(174 million yen decrease year-on-year, operating income 57 million yen year-on-year), Ordinary loss 112 million yen(183 million yen decrease year-on-year, ordinary income 71 million yen year-on-year), Net loss attributable to owners of parent 58 million yen(Decrease of 116 million yen year-on-year, net income attributable to owners of parent of 57 million yen year-on-year)have become"
net_trained = build_bert_model()
html_output = predict(input_text, net_trained)
print("======================Display of inference results======================")
print(input_text)
display(HTML(html_output))

When I run the above code, I get the following result.

Unknown words are displayed as [UNK]

12. Display inference results and mixed matrices with a large amount of test data

It makes inferences automatically using a large amount of test data and displays the information of ** mixed matrix ** to evaluate the result.

First, import the required modules.

from utils.config import *
from utils.predict import predict2, create_vocab_text, build_bert_model
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

Mixed matrices are displayed using ** sklearn **. We've also added a method predict2 in utils \ predict.py that returns only the predicted values (preds).

def predict2(input_text, net_trained):
    TEXT = pickle_load(PKL_FILE)   #Loading vocab data
    input = conver_to_model_format(input_text, TEXT)
    input_pad = 1  #In the word ID'<pad>':Because it is 1
    input_mask = (input != input_pad)
    outputs, attention_probs = net_trained(input, token_type_ids=None, attention_mask=None,
                                       output_all_encoded_layers=False, attention_show_flg=True)
    _, preds = torch.max(outputs, 1)  #Predict label
    #html_output = mk_html(input, preds, attention_probs, TEXT)  #HTML creation
    return preds

The data to be input is ** test.csv file **, and the following data is prepared.

Next, read the above test.csv with pandas, give the sentences of ** INPUT column ** to the trained BERT model one by one, make a negative / positive judgment, and store the prediction result in the ** PREDICT ** column. .. After processing to the end, save it as ** predicted_test.csv **.

df = pd.read_csv("test.csv", engine="python", encoding="utf-8-sig")
net_trained.eval()  #In inference mode.

for index, row in df.iterrows():
    df.at[index, "PREDICT"] = predict(row['INPUT'], net_trained).numpy()[0]  #In the case of GPU environment, ".cpu().numpy()Please.
    
df.to_csv("predicted_test .csv", encoding="utf-8-sig", index=False)

Predicted_test.csv with the following prediction results added is generated.

Finally, the mixed matrix information is displayed from the result of this csv file.

#Display (evaluation) of mixed matrix

y_true =[]
y_pred =[]
df = pd.read_csv("predicted_test .csv", engine="python", encoding="utf-8-sig")
for index, row in df.iterrows():
    if row['LABEL'] == 0:
        y_true.append("negative")
    if row['LABEL'] ==1:
        y_true.append("positive")
    if row['PREDICT'] ==0:
        y_pred.append("negative")
    if row['PREDICT'] ==1:
        y_pred.append("positive")

    
print(len(y_true))
print(len(y_pred))


#Confusion matrix(confusion matrix)Get
labels = ["negative", "positive"]
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred, labels=labels)

#Convert to data frame
cm_labeled = pd.DataFrame(cm, columns=labels, index=labels)

#View results
cm_labeled

A mixed matrix similar to the following is displayed. 混合行列.png

The view is that the negative and positve on the left are the labels of the actual data, and the negative and positve in the vertical direction are the predicted values. For example, the number "** 62 **" represents the number of negative data that was mistakenly predicted to be positive.

Next, the correct answer rate, precision rate, recall rate, and F value are displayed with the following code.

y_true =[]
y_pred =[]
df = pd.read_csv("predicted_test .csv", engine="python", encoding="utf-8-sig")
for index, row in df.iterrows():
    y_true.append(row["LABEL"])
    y_pred.append(row["PREDICT"])
        
print("Correct answer rate (ratio of correct answers out of all samples)={}%".format((round(accuracy_score(y_true, y_pred),2)) *100 ))
print("Adaptation rate (probability of being actually positive among those predicted to be positive)={}%".format((round(precision_score(y_true, y_pred),2)) *100 ))
print("Recall rate (probability predicted to be positive for positive data)={}%".format((round(recall_score(y_true, y_pred),2)) *100 ))
print("F1 (harmonic mean of precision and recall)={}%".format((round(f1_score(y_true, y_pred),2)) *100 ))

#Execution result

Correct answer rate (ratio of correct answers out of all samples)=76.0%
Adaptation rate (probability of being actually positive among those predicted to be positive)=85.0%
Recall rate (probability predicted to be positive for positive data)=71.0%
F1 (harmonic mean of precision and recall)=78.0%

13. Summary

This time, we added a fully connected layer (Linear) that judges negative and positive based on the BERT model and made it a binary classification, but it seems that it can be applied to various tasks such as multi-value classification and application to QA. , I would like to challenge in the future. ** * Added 2019/12/25 ** If you're interested in implementing the Django REST framework using the BERT negative-positive classifier created in this article, ** "Django Advent Calendar 2019 --Qiita Day 20 Article" See / items / 30e10a3db76c6f7c5b4d) **.

[PYTHON] Creation of negative / positive classifier using BERT