[PYTHON] Separation of Japanese surname and given name with BERT

Depending on the DB, the surname and given name are stored together, and there may be a desire to mechanically separate them. It's surprisingly difficult to do this even with a complete list of surnames. BERT is a hot topic right now, but I would like to introduce it because I was able to separate surnames and given names with high accuracy by learning personal names.

result

The source code is quite long, so I will show it from the result. We were able to separate 1,200 verification data with an accuracy of ** 99.0% **. The contents of the verification data and a part of the prediction (surname only) are as follows.

last_name first_name full_name pred
89 Shape Mayu Mayu Katabe Shape
1114 Kumazoe Norio Norio Kumazoe Kumazoe
1068 Kimoto Souho Kimoto Soho Kimoto
55 Yashiki Takajin Hiroki Yashiki Takajin Yashiki Takajin
44 Basic Shodai Kishodai Basic

The 12 failed cases are as follows. Toshikatsu Sabune is likely to be divided into Toshikatsu Sabane even for humans.

last_name first_name full_name pred
11 Toshi Saburi Win Toshikatsu Sabane Sabane
341 Brush Kasumi Brush Kasumi Brush flower
345 Shinto Shinichi Shinichi Shinto Makoto Shinto
430 Chestnut Kanae Kanae Kuri Kurika
587 Keisuke Nina Kei Ryojina Kei
785 Bansho Good Bansho Turn
786 Yutaka Wakana Kana Toyowa Toyokazu
995 Seri Yu Seriyoshi Se
1061 So Real princess Somihime Somi
1062 Instep fruit Kogi Nomi Koki
1155 Hotaka Natsuho Hotaka Natsuho Hotakaho
1190 Extremely average dream Extreme dream very

By the way, using only the preset dictionary (ipadic) with janome (a morphological analysis tool that is completed only with python and has the same performance as mecab) as follows With a simple first and last name separation, the accuracy was 34.5%.

def extract_last_name(sentence):
    for token in tokenizer.tokenize(sentence):
        if 'Surname' in token.part_of_speech:
            return token.surface

df['pred'] = df['full_name'].apply(lambda full_name: extract_last_name(full_name))

Obtain the surname dictionary from the surname database, feed it to janome as shown below, and analyze it. Has improved to 79.7%.

tokenizer2 = Tokenizer('last_name_dic.csv', udic_enc="utf8")
def extract_last_name2(sentence):
    token_arr = [token for token in tokenizer2.tokenize(sentence)]
    if 'Surname' in token_arr[0].part_of_speech:
        return token_arr[0].surface

df['pred2'] = df['full_name'].apply(lambda full_name: extract_last_nam2(full_name))

Perhaps adding a name list will further improve the accuracy, but it seemed to be quite difficult to obtain and process, so I would like to verify it if I have time. (I don't think so)

Source code

First, import what you need.

import pandas as pd
import numpy as np
from transformers import BertConfig, BertTokenizer, BertJapaneseTokenizer, BertForTokenClassification
from keras.preprocessing.sequence import pad_sequences
import torch
import MeCab
import math

As a preset, I think that there are many examples of using bert-base-japanese-whole-word-masking, but this time it will be a problem if weird division is done at the time of tokenize, so this time one character at a time Use the char to split.

tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-char-whole-word-masking')

I used Amazing Name Generator as the source of the teacher data. It is an interesting site just by looking at the unusualness of the name, which is quantified. This time, the number of names was 48,000 and the types of surnames were about 22,000. It is fairly widely distributed, including minor surnames. The format of csv is only full_name, last_name, first_name. First, tokenize each character with the following code.

df = pd.read_csv('name_list.csv')
text1s = list(df.full_name.values)
targets = list(df.last_name.values)
text1_tokenize = [tokenizer.encode(s) for s in text1s]
target_tokenize = [[tokenizer.encode(vv)[1:-1] for vv in v]  for v in targets]

As a flow, after dividing full_name one character at a time, the probability that each character will be 1 for the surname and 0 for the first name will be given. In order for BERT to understand the correct answer data, for example, Taro Tanaka-> ['Ta','Middle','Ta','Ro']-> [1, 1, 0, 0] I will make it. attention_masks is a simple replacement of the target array with 1 (maybe unnecessary)

def make_tags_arr(x, token):
    start_indexes = arr_indexes(x, token)
    max_len = len(x)
    token_len = len(token)
    arr = [0] * max_len
    for i in start_indexes:
        arr[i:i+token_len] = [1] * token_len
    return arr

tags_ids = []
for i in range(len(text1_tokenize)):
    text1 = text1_tokenize[i]
    targets = target_tokenize[i]
    
    tmp = [0] * len(text1)
    for t in targets:
        # [0,0,1,1,0,0...]Make a tag array
        arr = make_tags_arr(text1, t)
        tmp = [min(x + y, 1) for (x, y) in zip(tmp, arr)]
    tags_ids.append(tmp)

attention_masks = [[float(i > 0) for i in ii] for ii in text1_tokenize]

BERT must all have the same token array length, so padding is performed. Then split the dataset.

MAX_LEN = 32
input_ids = pad_sequences(text1_tokenize, maxlen=MAX_LEN, dtype="long", truncating="pre", padding="pre")
tags_ids = pad_sequences(tags_ids, maxlen=MAX_LEN, dtype="long", truncating="pre", padding="pre")
attention_masks = pad_sequences(attention_masks, maxlen=MAX_LEN, dtype="long", truncating="pre", padding="pre")

from sklearn.model_selection import train_test_split
RAN_SEED = 2020
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, tags_ids, random_state=RAN_SEED, test_size=0.1)
train_masks, validation_masks = train_test_split(attention_masks, random_state=RAN_SEED, test_size=0.2)

train_inputs = torch.LongTensor(train_inputs)
validation_inputs = torch.LongTensor(validation_inputs)
train_labels = torch.LongTensor(train_labels)
validation_labels = torch.LongTensor(validation_labels)
train_masks = torch.LongTensor(train_masks)
validation_masks = torch.LongTensor(validation_masks)

Use GPUorCPU to load the dataset. Then load the pre-trained model.

if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
batch_size = 32

# Create the DataLoader for our training set.
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
# Create the DataLoader for our validation set.
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

from transformers import AdamW, BertConfig
model_token_cls = BertForTokenClassification.from_pretrained('cl-tohoku/bert-base-japanese-char-whole-word-masking', num_labels=2)
model_token_cls.cuda()

Shows an overview of the model. It is not related to processing, so you can skip it.

# Get all of the model's parameters as a list of tuples.
params = list(model_token_cls.named_parameters())
print('The BERT model has {:} different named parameters.\n'.format(len(params)))
print('==== Embedding Layer ====\n')
for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
print('\n==== First Transformer ====\n')
for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
print('\n==== Output Layer ====\n')
for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

Define accuracy. Only used to display validation data. Here, it is the F1 value. For each character, the surname or first name is judged and if it is high, it approaches 1.

import datetime
def flat_accuracy(pred_masks, labels, input_masks):
    tp = ((pred_masks == 1) * (labels == 1)).sum().item()
    fp = ((pred_masks == 1) * (labels == 0)).sum().item()
    fn = ((pred_masks == 0) * (labels == 1)).sum().item()
    tn = ((pred_masks == 0) * (labels == 0)).sum().item()
    precision = tp/(tp+fp)
    recall = tp/(tp+fn)
    f1 = 2*precision*recall/(precision+recall)

    return f1

def format_time(elapsed):
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

This is the main part of the training.

from torch.optim import Adam
from transformers import get_linear_schedule_with_warmup

param_optimizer = list(model_token_cls.named_parameters())
no_decay = ["bias", "gamma", "beta"]
optimizer_grouped_parameters = [
  {'params' : [p for n, p in param_optimizer if not any (nd in n for nd in no_decay)],
  'weight_decay_rate' : 0.01},
  {'params' : [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
  'weight_decay_rate' : 0.0}
]
optimizer = Adam(optimizer_grouped_parameters, lr=3e-5)

epochs = 3
max_grad_norm = 1.0
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)

#training
for epoch_i in range(epochs):
    # TRAIN loop
    model_token_cls.train()
    train_loss = 0
    nb_train_examples, nb_train_steps = 0, 0
    t0 = time.time()
    
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')
    
    for step, batch in enumerate(train_dataloader):
        
        if step % 40 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))
        
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        # forward pass
        loss = model_token_cls(b_input_ids, token_type_ids = None, attention_mask = b_input_mask, labels = b_labels)
        # backward pass
        loss[0].backward()
        # track train loss
        train_loss += loss[0].item()
        nb_train_examples += b_input_ids.size(0)
        nb_train_steps += 1
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(parameters = model_token_cls.parameters(), max_norm = max_grad_norm)
        # update parameters
        optimizer.step()
        scheduler.step()
        model_token_cls.zero_grad()
        
    # Calculate the average loss over the training data.
    avg_train_loss = train_loss / len(train_dataloader)
    
    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))
    
    # ========================================
    #               Validation
    # ========================================
    print("")
    print("Running Validation...")
    t0 = time.time()
    model_token_cls.eval()
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    for batch in validation_dataloader:
        batch = tuple(t.to(device) for t in batch)

        b_input_ids, b_input_mask, b_labels = batch
        
        with torch.no_grad():        
            outputs = model_token_cls(b_input_ids, token_type_ids = None, attention_mask = b_input_mask, labels = b_labels)
        
        result = outputs[1].to('cpu')

        labels = b_labels.to('cpu')
        input_mask = b_input_mask.to('cpu')
        
        # Mask predicted label
        pred_masks = torch.min(torch.argmax(result, dim=2), input_mask)
        
        # Calculate the accuracy for this batch of test sentences.
        tmp_eval_accuracy = flat_accuracy(pred_masks, labels, input_mask)
        
        # Accumulate the total accuracy.
        eval_accuracy += tmp_eval_accuracy
        # Track the number of batches
        nb_eval_steps += 1
        
    # Report the final accuracy for this validation run.
    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))

print("Train loss: {}".format(train_loss / nb_train_steps))

During training ... It took about 10 minutes.

Save the model you trained here.

pd.to_pickle(model_token_cls, 'First and last name separation model.pkl')

Based on the verification data prepared separately, we will actually enter the characters judged as surnames in keywords. The characters defined by the tokenizer above are 4,000 characters, and the characters not included in the list are [UNK]. It seems to be difficult for tokenizer to remember everything including variant characters, so I decided to add special processing in such cases.

df = pd.read_csv('name_list_valid.csv')
keywords = []
MAX_LEN = 32
alls = list(df.full_name)
batch_size = 100

for i in range(math.ceil(len(alls)/batch_size)):
    print(i)
    s2 = list(df.full_name[i*batch_size:(i+1)*batch_size])
    d = torch.LongTensor(pad_sequences([tokenizer.encode(s) for s in s2], maxlen=MAX_LEN, dtype="long", truncating="pre", padding="pre")).cuda()
    attention_mask = (d > 0) * 1
    output = model_token_cls(d, token_type_ids = None, attention_mask = attention_mask)
    result = output[0].to('cpu')
    pred_masks = torch.min(torch.argmax(result, dim=2), attention_mask.to('cpu'))
    d = d.to('cpu')

    pred_mask_squeeze = pred_masks.nonzero().squeeze()
    b = d[pred_mask_squeeze.T.numpy()]
    pred_mask_squeeze[:,1]=b
    for j in range(len(s2)):
        tmp = pred_mask_squeeze[pred_mask_squeeze[:,0] == j]
        s = tokenizer.convert_ids_to_tokens(tmp[:,1])
        #If the restoration result contains unknown, get the number of characters in the result from the beginning.
        if '[UNK]' in s:
            s = s2[j][0:len(s)]
        
        keywords.append(''.join(s))

Still, the judgment may be strange if the surname and the first name contain the same kanji, but it seems that it will be even better if the condition that the surname is continuous is added later. I found that even if I didn't actually make a list of surnames and names, I could learn well by putting the full name in BERT to some extent. This time, I'm not sure what I'm doing, but by properly creating teacher data, the same implementation can be applied to keyword extraction logic in sentences.

Recommended Posts

Separation of Japanese surname and given name with BERT
Coexistence of Fcitx and Zoom ~ With Japanese localization ~
Coexistence of Python2 and 3 with CircleCI (1.0)
PyOpenGL GUI selection and separation of drawing and GUI
Import of japandas with pandas 1.0 and above
Separation of design and data in matplotlib
I tried to compare the accuracy of Japanese BERT and Japanese Distil BERT sentence classification with PyTorch & Introduction of BERT accuracy improvement technique
[Japanese version] Judgment of word similarity for polysemous words using ELMo and BERT
Get the stock price of a Japanese company with Python and make a graph
Wrap and display Japanese sentences well with pyglet
Get git branch name and tag name with python
Rewrite the name of the namespaced tag with lxml
Script to tweet with multiples of 3 and numbers with 3 !!
Implementation of TRIE tree with Python and LOUDS
Extract zip with Python (Japanese file name support)
Wavelet transform of images with PyWavelets and OpenCV
Continuation of multi-platform development with Electron and Python
Example of reading and writing CSV with Python