[PYTHON] Easily build a natural language processing model with BERT + LightGBM + optuna

This is an article about creating a natural language classification model with BERT, LightGBM, and optuna. The data uses livedoor news corpus.

Also, all the code used in this article is below. https://github.com/kazuki-hayakawa/bert_lightgbm_model

When you actually run it, please do a git clone and try it at hand.

Overall flow

  1. Download livedoor news corpus data
  2. Download the BERT model that has learned Japanese
  3. Start the experimental container
  4. Feature generation
  5. Model training
  6. Evaluation of the model with test data
  7. End of container

Download livedoor news corpus data

Download the livedoor news corpus data to the data / raw directory. The download scripts are summarized in src / data / download_livedoor_news.sh.

Then run src / data / preprocess.py to preprocess and save the data separately for training and testing.

src/data/preprocess.py


import os
import glob
from tqdm import tqdm
import pandas as pd
from sklearn.model_selection import train_test_split


def read_text(text_filepath):
    """Read only the text from the 4th line onward according to the format of livedoor news"""
    with open(text_filepath, 'r') as f:
        lines = f.readlines()
        lines = lines[3:]

    text = ' '.join(lines)
    #Delete double-byte space and line feed code
    text = text.replace('\u3000', '').replace('\n', '')
    return text


def main():
    #Download in advance_livedoor_news.Run sh to get the data
    exclude_files = ['CHANGES.txt', 'README.txt', 'LICENSE.txt']
    all_file_paths = glob.glob('../../data/raw/text/**/*.txt', recursive=True)
    all_file_paths = [p for p in all_file_paths
                      if os.path.basename(p) not in exclude_files]

    df_processed = pd.DataFrame(columns=['id', 'media', 'text'])
    for idx, filepath in enumerate(tqdm(all_file_paths)):
        media = os.path.dirname(filepath).replace('../../data/raw/text/', '')
        text = read_text(filepath)
        row = pd.Series([idx + 1, media, text], index=df_processed.columns)
        df_processed = df_processed.append(row, ignore_index=True)

    df_train, df_test, _, _ = train_test_split(
        df_processed, df_processed['media'], test_size=0.1, random_state=0,
        stratify=df_processed['media']
    )
    df_train.to_csv('../../data/processed/train_dataset.csv', index=False)
    df_test.to_csv('../../data/processed/test_dataset.csv', index=False)


if __name__ == '__main__':
    main()

Download Japanese learned BERT model

For the procedure, refer to Creating a Japanese BERT sentence embedding calculation server using bert-as-service.

Create the models / bert_jp directory and download Japanese learned BERT model.

Rename the file so that it can be loaded by bert-as-service

mv model.ckpt-1400000.index bert_model.ckpt.index
mv model.ckpt-1400000.meta bert_model.ckpt.meta 
mv model.ckpt-1400000.data-00000-of-00001 bert_model.ckpt.data-00000-of-00001

Creating a vocabulary file

cut -f1 wiki-ja.vocab | sed -e "1 s/<unk>/[UNK]/g" > vocab.txt

Creating a BERT configuration file

bert_jp/bert_config.json


{
    "attention_probs_dropout_prob" : 0.1,
    "hidden_act" : "gelu",
    "hidden_dropout_prob" : 0.1,
    "hidden_size" : 768,
    "initializer_range" : 0.02,
    "intermediate_size" : 3072,
    "max_position_embeddings" : 512,
    "num_attention_heads" : 12,
    "num_hidden_layers" : 12,
    "type_vocab_size" : 2,
    "vocab_size" : 32000
}

Launching the experimental container

Start the container by running docker-compose up -d. (Refer to the GitHub repository for Dockerfile and docker-compose.yml) Then run docker-compose exec analytics / bin / bash to enter the container.

Feature generation

The Bert class that operates BERT is implemented as follows.

src/features/bert.py


import sentencepiece as spm
from bert_serving.client import BertClient


class Bert():
    """ Bert model client
        Before usage, you need to run bert server.
    """

    def __init__(self, bert_model_path, client_ip='0.0.0.0'):
        self.bert_client = BertClient(ip=client_ip)
        self.spm_model = spm.SentencePieceProcessor()
        self.spm_model.load(bert_model_path + 'wiki-ja.model')

    def _parse(self, text):
        text = str(text).lower()
        encoded_texts = self.spm_model.EncodeAsPieces(text)
        encoded_texts = [t for t in encoded_texts if t.strip()]
        return encoded_texts

    def text2vec(self, texts):
        """

        Args:
            texts (list):List of Japanese strings

        Returns:
            numpy array:Distributed representation tensor of text
        """

        parsed_texts = list(map(self._parse, texts))
        tensor = self.bert_client.encode(parsed_texts, is_tokenized=True)
        return tensor

Use this to convert natural language into a vector. At the same time, the media name that is the objective variable is also converted to an integer label. Run src / features / build_features.py.

src/features/build_features.py


import subprocess
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from bert import Bert


def build_features(df, bert_client):
    vectors = bert_client.text2vec(df['text'])
    le = LabelEncoder()
    targets = le.fit_transform(df['media'])
    return vectors, targets


def main():
    BERT_MODEL_PATH = '../../models/bert_jp/'

    # start bert server
    commands = ['bert-serving-start', '-model_dir',
                BERT_MODEL_PATH, '-num_worker=1', '-cpu']
    p = subprocess.Popen(commands, shell=False,
                         stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

    # start bert client
    bert = Bert(bert_model_path=BERT_MODEL_PATH, client_ip='0.0.0.0')

    # build train features
    train_dataset = pd.read_csv('../../data/processed/train_dataset.csv')
    train_vectors, train_targets = build_features(train_dataset, bert)
    np.save('../../data/features/train_vectors', train_vectors)
    np.save('../../data/features/train_targets', train_targets)

    # build test features
    test_dataset = pd.read_csv('../../data/processed/test_dataset.csv')
    test_vectors, test_targets = build_features(test_dataset, bert)
    np.save('../../data/features/test_vectors', test_vectors)
    np.save('../../data/features/test_targets', test_targets)

    p.terminate()


if __name__ == '__main__':
    main()

Model training

The class MediaClassifier, which defines a model for classifying news media, is implemented as follows.

src/models/classifier.py


import os
import uuid
import pickle
import numpy as np
import lightgbm as lgb
import optuna
from datetime import datetime, timedelta, timezone
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


class MediaClassifier():
    """model for multi-class classification of livedoor news corpus"""

    def __init__(self, output_dir, use_gpu=False):
        JST = timezone(timedelta(hours=+9), 'JST')
        dt_now = datetime.now(JST)
        training_date = dt_now.strftime("%Y%m%d_%H%M%S")
        self.output_dir = os.path.join(output_dir, training_date)
        os.makedirs(self.output_dir, exist_ok=True)
        
        self.device = 'gpu' if use_gpu else 'cpu'

    def train(self, features, targets):
        X_train, X_test, y_train, y_test = train_test_split(
            features, targets, test_size=0.2, random_state=0)

        def objectives(trial):
            trial_uuid = str(uuid.uuid4())
            trial.set_user_attr("trial_uuid", trial_uuid)

            #Parameter and callback settings
            params = {
                #Since the number of media of liverdoor news corpus is 9, 9 multi-class classifications
                'objective': 'multiclass',
                'num_class': 9,
                'metric': 'multi_logloss',
                'num_leaves': trial.suggest_int("num_leaves", 10, 500),
                'feature_fraction': trial.suggest_uniform("feature_fraction", 0.0, 1.0),
                'class_weight': 'balanced',
                'device': self.device,
                'verbose': -1
            }

            pruning_callback = optuna.integration.LightGBMPruningCallback(
                trial, "multi_logloss")

            # training
            lgb_model = lgb.train(params, lgb.Dataset(X_train, y_train), num_boost_round=100,
                                  valid_sets=lgb.Dataset(X_test, y_test), callbacks=[pruning_callback])

            y_pred_train = np.argmax(lgb_model.predict(X_train), axis=1)
            y_pred_test = np.argmax(lgb_model.predict(X_test), axis=1)
            accuracy_train = accuracy_score(y_train, y_pred_train)
            accuracy_test = accuracy_score(y_test, y_pred_test)

            trial.set_user_attr("accuracy_train", accuracy_train)
            trial.set_user_attr("accuracy_test", accuracy_test)

            #Save model
            output_file = os.path.join(self.output_dir, f"{trial_uuid}.pkl")
            with open(output_file, "wb") as fp:
                pickle.dump(lgb_model, fp)

            return 1.0 - accuracy_test

        study = optuna.create_study()
        study.optimize(objectives, n_trials=100)

        result_df = study.trials_dataframe()
        result_csv = os.path.join(self.output_dir, "result.csv")
        result_df.to_csv(result_csv, index=False)

        return study.best_trial.user_attrs

Perform training on the above model.

src/models/train_model.py


import numpy as np
from classifier import MediaClassifier


def main():
    train_vectors = np.load('../../data/features/train_vectors.npy')
    train_targets = np.load('../../data/features/train_targets.npy')

    model = MediaClassifier(output_dir='../../models/training_models',
                            use_gpu=False)

    best_result = model.train(train_vectors, train_targets)
    print('best result \n', best_result)


if __name__ == '__main__':
    main()

After the execution is completed, the trial uuid and score of the best performing model will be output as follows, so make a note of it.

{'trial_uuid': 'BEST_MODEL_TRIAL_UUID', 'accuracy_train': 1.0, 'accuracy_test': 0.7398190045248869}

The BEST_MODEL_TRIAL_UUID part is actually the uuid.

Evaluation of the model with test data

Evaluation using test data is performed as follows.

src/models/predict_model.py


import argparse
import pickle
import numpy as np
from sklearn.metrics import accuracy_score


def main(args):
    test_vectors = np.load('../../data/features/test_vectors.npy')
    test_targets = np.load('../../data/features/test_targets.npy')

    with open(args.best_model, 'rb') as f:
        model = pickle.load(f)

    pred_targets = np.argmax(model.predict(test_vectors), axis=1)
    accuracy = accuracy_score(test_targets, pred_targets)
    print('test accuracy : {:.2f}'.format(accuracy))

    
if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    parser.add_argument('--best_model', help='best model pickle file path.')

    args = parser.parse_args()

    main(args)

A directory is automatically generated when you start training the model, so specify the path. The following is an execution example.

$ cd src/models
$ python predict_model.py --best_model='../../models/training_models/TRINING_DATE/BEST_MODEL_TRIAL_UUID.pkl'

test accuracy : 0.73

The correct answer rate was 0.73 in the model I actually created. It's not very expensive, but I think it can be raised a little more by increasing the amount of data and devising ways to separate the text.

End of container

Exit the container with ʻexitand exit withdocker-compose down`.

Recommended Posts

Easily build a natural language processing model with BERT + LightGBM + optuna
Easily build a development environment with Laragon
3. Natural language processing with Python 2-1. Co-occurrence network
3. Natural language processing with Python 1-1. Word N-gram
I tried natural language processing with transformers.
3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko
Build a C language development environment with a container
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
[Python] I played with natural language processing ~ transformers ~
Let's enjoy natural language processing with COTOHA API
3. Natural language processing with Python 3-3. A year of corona looking back at TF-IDF
Set up a development environment for natural language processing
Python: Natural language processing
Model using convolutional neural network in natural language processing
3. Natural language processing with Python 4-1. Analysis for words with KWIC
RNN_LSTM2 Natural language processing
Building an environment for natural language processing with Python
3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation]
Natural language processing with Word2Vec developed by a researcher in the US google (original data)
[Practice] Make a Watson app with Python! # 3 [Natural language classification]
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
I tried to divide with a deep learning language model
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
100 Language Processing with Python Knock 2015
Natural language processing 1 Morphological analysis
Adjusting LightGBM parameters with Optuna
Natural language processing 3 Word continuity
Easily build CNN with Keras
Easily cProfile with a decorator
Natural language processing 2 Word similarity
I will write a detailed explanation to death while solving 100 natural language processing knock 2020 with Python
Dockerfile with the necessary libraries for natural language processing in python
Summarize how to preprocess text (natural language processing) with tf.data.Dataset api
Natural Language Processing Case Study: Word Frequency in'Anne with an E'
100 natural language processing knocks Chapter 4 Commentary
100 Language Processing Knock with Python (Chapter 1)
Build a deb file with Docker
Natural Language: Word2Vec Part3 --CBOW model
100 Language Processing Knock with Python (Chapter 3)
Artificial language Lojban and natural language processing (artificial language processing)
Build a web application with Django
Easily build virtual machines with Vagrant
Make a model iterator with PySide
Preparing to start natural language processing
Natural language processing analyzer installation summary
Easily draw a map with matplotlib.basemap
Natural Language: Word2Vec Part2 --Skip-gram model
Optimize RF or lightGBM with Optuna
Learn the basics of document classification by natural language processing, topic model
That's right, let's eat it. [Natural language processing starting with Kyoto dialect]
[Natural language processing] Extract keywords from Kakenhi database with MeCab-ipadic-neologd and termextract