[PYTHON] Anyway, classify natural language immediately [simple transformer, transformer]

Anyway, I want to classify the lowest line of natural language (text)

Introducing methods for learning and inference without preprocessing or vectorization of words, accuracy is often not that high.

simpletransformer It is implemented in a python package to make it easy to do various things in natural language.

Installation

I wrote it on the assumption that basic things such as pandas are included, and you can do both with conda and pip.

Install pytorch

If you already have pytorch, skip the steps below. For gpu

pip install pytorch>=1.6 cudatoolkit=10.2 -c pytorch

cpu only

pip install pytorch cpuonly -c pytorch

Install simple transformers

I don't think simple transformers can be done with conda.

pip install simpletransformers

Optional (wandb) installation (optional)

It is an installation of wandb for visualizing learning on the web, it works without it, so please skip it if you do not need it. How to use is not described.

pip install wandb

how to use

Use it assuming that the pandas dataframe contains the text and the corresponding label.

Changed the name of the column where the text of pandas is written to "text" and the name of the column where the label is written to "label".

Setting of each param

params = {
    "output_dir": "out_models/bert_model/",
    "max_seq_length": 256,
    "train_batch_size": 128,
    "eval_batch_size": 128,
    "num_train_epochs": 10,
    "learning_rate": 1e-4,
}

Specify any directory for output_dir. It will be the save destination directory of the model created at the time of learning.

Code used for learning

from simpletransformers.classification import ClassificationModel

model = ClassificationModel(“bert”, “bert-base-cased”,num_labels=10, args=params, use_cuda=True)

model.train_model(train_df)

Enter the name of each model you want to use in "bert" and the name of the pre-trained model you want to use in "bert-base-cased". See here for trained models and their names. num_labels is the number of labels Please format the labels so that they start from 0. train_df is a dataframe created with "text" and "label" columns.

Inference method

It is assumed that there is a dataframe for test that has the same format as the dataframe for training.

pred, _ = model.predict(test_df['text'])

The predicted label is output to pred.

Summary

The documentation for simpletransformers is here (https://simpletransformers.ai/). It seems that there are many other functions, so please try using it. Keep in mind that pre-processing, post-processing, etc. are important, so if you ignore them this time, you cannot expect much accuracy.

Recommended Posts

Anyway, classify natural language immediately [simple transformer, transformer]
Natural Language: GPT --Japanese Generative Pretraining Transformer
Natural Language: Machine Translation Part2 --Neural Machine Translation Transformer
Python: Natural language processing
RNN_LSTM2 Natural language processing