[PYTHON] Separation of Japanese surname and given name with BERT

Depending on the DB, the surname and given name are stored together, and there may be a desire to mechanically separate them. It's surprisingly difficult to do this even with a complete list of surnames. BERT is a hot topic right now, but I would like to introduce it because I was able to separate surnames and given names with high accuracy by learning personal names.

result

The source code is quite long, so I will show it from the result. We were able to separate 1,200 verification data with an accuracy of ** 99.0% **. The contents of the verification data and a part of the prediction (surname only) are as follows.

	last_name	first_name	full_name	pred
89	Shape	Mayu	Mayu Katabe	Shape
1114	Kumazoe	Norio	Norio Kumazoe	Kumazoe
1068	Kimoto	Souho	Kimoto Soho	Kimoto
55	Yashiki Takajin	Hiroki	Yashiki Takajin	Yashiki Takajin
44	Basic	Shodai	Kishodai	Basic

The 12 failed cases are as follows. Toshikatsu Sabune is likely to be divided into Toshikatsu Sabane even for humans.

	last_name	first_name	full_name	pred
11	Toshi Saburi	Win	Toshikatsu Sabane	Sabane
341	Brush	Kasumi	Brush Kasumi	Brush flower
345	Shinto	Shinichi	Shinichi Shinto	Makoto Shinto
430	Chestnut	Kanae	Kanae Kuri	Kurika
587	Keisuke	Nina	Kei Ryojina	Kei
785	Bansho	Good	Bansho	Turn
786	Yutaka	Wakana	Kana Toyowa	Toyokazu
995	Seri	Yu	Seriyoshi	Se
1061	So	Real princess	Somihime	Somi
1062	Instep	fruit	Kogi Nomi	Koki
1155	Hotaka	Natsuho	Hotaka Natsuho	Hotakaho
1190	Extremely average	dream	Extreme dream	very

By the way, using only the preset dictionary (ipadic) with janome (a morphological analysis tool that is completed only with python and has the same performance as mecab) as follows With a simple first and last name separation, the accuracy was 34.5%.

def extract_last_name(sentence):
    for token in tokenizer.tokenize(sentence):
        if 'Surname' in token.part_of_speech:
            return token.surface

df['pred'] = df['full_name'].apply(lambda full_name: extract_last_name(full_name))

Obtain the surname dictionary from the surname database, feed it to janome as shown below, and analyze it. Has improved to 79.7%.

tokenizer2 = Tokenizer('last_name_dic.csv', udic_enc="utf8")
def extract_last_name2(sentence):
    token_arr = [token for token in tokenizer2.tokenize(sentence)]
    if 'Surname' in token_arr[0].part_of_speech:
        return token_arr[0].surface

df['pred2'] = df['full_name'].apply(lambda full_name: extract_last_nam2(full_name))

Perhaps adding a name list will further improve the accuracy, but it seemed to be quite difficult to obtain and process, so I would like to verify it if I have time. (I don't think so)

Source code

First, import what you need.

import pandas as pd
import numpy as np
from transformers import BertConfig, BertTokenizer, BertJapaneseTokenizer, BertForTokenClassification
from keras.preprocessing.sequence import pad_sequences
import torch
import MeCab
import math

As a preset, I think that there are many examples of using bert-base-japanese-whole-word-masking, but this time it will be a problem if weird division is done at the time of tokenize, so this time one character at a time Use the char to split.

tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-char-whole-word-masking')

I used Amazing Name Generator as the source of the teacher data. It is an interesting site just by looking at the unusualness of the name, which is quantified. This time, the number of names was 48,000 and the types of surnames were about 22,000. It is fairly widely distributed, including minor surnames. The format of csv is only full_name, last_name, first_name. First, tokenize each character with the following code.

df = pd.read_csv('name_list.csv')
text1s = list(df.full_name.values)
targets = list(df.last_name.values)
text1_tokenize = [tokenizer.encode(s) for s in text1s]
target_tokenize = [[tokenizer.encode(vv)[1:-1] for vv in v]  for v in targets]

As a flow, after dividing full_name one character at a time, the probability that each character will be 1 for the surname and 0 for the first name will be given. In order for BERT to understand the correct answer data, for example, Taro Tanaka-> ['Ta','Middle','Ta','Ro']-> [1, 1, 0, 0] I will make it. attention_masks is a simple replacement of the target array with 1 (maybe unnecessary)

def make_tags_arr(x, token):
    start_indexes = arr_indexes(x, token)
    max_len = len(x)
    token_len = len(token)
    arr = [0] * max_len
    for i in start_indexes:
        arr[i:i+token_len] = [1] * token_len
    return arr

tags_ids = []
for i in range(len(text1_tokenize)):
    text1 = text1_tokenize[i]
    targets = target_tokenize[i]
    
    tmp = [0] * len(text1)
    for t in targets:
        # [0,0,1,1,0,0...]Make a tag array
        arr = make_tags_arr(text1, t)
        tmp = [min(x + y, 1) for (x, y) in zip(tmp, arr)]
    tags_ids.append(tmp)

attention_masks = [[float(i > 0) for i in ii] for ii in text1_tokenize]

BERT must all have the same token array length, so padding is performed. Then split the dataset.

MAX_LEN = 32
input_ids = pad_sequences(text1_tokenize, maxlen=MAX_LEN, dtype="long", truncating="pre", padding="pre")
tags_ids = pad_sequences(tags_ids, maxlen=MAX_LEN, dtype="long", truncating="pre", padding="pre")
attention_masks = pad_sequences(attention_masks, maxlen=MAX_LEN, dtype="long", truncating="pre", padding="pre")

from sklearn.model_selection import train_test_split
RAN_SEED = 2020
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, tags_ids, random_state=RAN_SEED, test_size=0.1)
train_masks, validation_masks = train_test_split(attention_masks, random_state=RAN_SEED, test_size=0.2)

train_inputs = torch.LongTensor(train_inputs)
validation_inputs = torch.LongTensor(validation_inputs)
train_labels = torch.LongTensor(train_labels)
validation_labels = torch.LongTensor(validation_labels)
train_masks = torch.LongTensor(train_masks)
validation_masks = torch.LongTensor(validation_masks)

Use GPUorCPU to load the dataset. Then load the pre-trained model.

if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
batch_size = 32

# Create the DataLoader for our training set.
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
# Create the DataLoader for our validation set.
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

from transformers import AdamW, BertConfig
model_token_cls = BertForTokenClassification.from_pretrained('cl-tohoku/bert-base-japanese-char-whole-word-masking', num_labels=2)
model_token_cls.cuda()

Shows an overview of the model. It is not related to processing, so you can skip it.

# Get all of the model's parameters as a list of tuples.
params = list(model_token_cls.named_parameters())
print('The BERT model has {:} different named parameters.\n'.format(len(params)))
print('==== Embedding Layer ====\n')
for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
print('\n==== First Transformer ====\n')
for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
print('\n==== Output Layer ====\n')
for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

Define accuracy. Only used to display validation data. Here, it is the F1 value. For each character, the surname or first name is judged and if it is high, it approaches 1.

import datetime
def flat_accuracy(pred_masks, labels, input_masks):
    tp = ((pred_masks == 1) * (labels == 1)).sum().item()
    fp = ((pred_masks == 1) * (labels == 0)).sum().item()
    fn = ((pred_masks == 0) * (labels == 1)).sum().item()
    tn = ((pred_masks == 0) * (labels == 0)).sum().item()
    precision = tp/(tp+fp)
    recall = tp/(tp+fn)
    f1 = 2*precision*recall/(precision+recall)

    return f1

def format_time(elapsed):
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

This is the main part of the training.

from torch.optim import Adam
from transformers import get_linear_schedule_with_warmup

param_optimizer = list(model_token_cls.named_parameters())
no_decay = ["bias", "gamma", "beta"]
optimizer_grouped_parameters = [
  {'params' : [p for n, p in param_optimizer if not any (nd in n for nd in no_decay)],
  'weight_decay_rate' : 0.01},
  {'params' : [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
  'weight_decay_rate' : 0.0}
]
optimizer = Adam(optimizer_grouped_parameters, lr=3e-5)

epochs = 3
max_grad_norm = 1.0
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)

#training
for epoch_i in range(epochs):
    # TRAIN loop
    model_token_cls.train()
    train_loss = 0
    nb_train_examples, nb_train_steps = 0, 0
    t0 = time.time()
    
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')
    
    for step, batch in enumerate(train_dataloader):
        
        if step % 40 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))
        
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        # forward pass
        loss = model_token_cls(b_input_ids, token_type_ids = None, attention_mask = b_input_mask, labels = b_labels)
        # backward pass
        loss[0].backward()
        # track train loss
        train_loss += loss[0].item()
        nb_train_examples += b_input_ids.size(0)
        nb_train_steps += 1
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(parameters = model_token_cls.parameters(), max_norm = max_grad_norm)
        # update parameters
        optimizer.step()
        scheduler.step()
        model_token_cls.zero_grad()
        
    # Calculate the average loss over the training data.
    avg_train_loss = train_loss / len(train_dataloader)
    
    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))
    
    # ========================================
    #               Validation
    # ========================================
    print("")
    print("Running Validation...")
    t0 = time.time()
    model_token_cls.eval()
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    for batch in validation_dataloader:
        batch = tuple(t.to(device) for t in batch)

        b_input_ids, b_input_mask, b_labels = batch
        
        with torch.no_grad():        
            outputs = model_token_cls(b_input_ids, token_type_ids = None, attention_mask = b_input_mask, labels = b_labels)
        
        result = outputs[1].to('cpu')

        labels = b_labels.to('cpu')
        input_mask = b_input_mask.to('cpu')
        
        # Mask predicted label
        pred_masks = torch.min(torch.argmax(result, dim=2), input_mask)
        
        # Calculate the accuracy for this batch of test sentences.
        tmp_eval_accuracy = flat_accuracy(pred_masks, labels, input_mask)
        
        # Accumulate the total accuracy.
        eval_accuracy += tmp_eval_accuracy
        # Track the number of batches
        nb_eval_steps += 1
        
    # Report the final accuracy for this validation run.
    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))

print("Train loss: {}".format(train_loss / nb_train_steps))

During training ... It took about 10 minutes.

Save the model you trained here.

pd.to_pickle(model_token_cls, 'First and last name separation model.pkl')

Based on the verification data prepared separately, we will actually enter the characters judged as surnames in keywords. The characters defined by the tokenizer above are 4,000 characters, and the characters not included in the list are [UNK]. It seems to be difficult for tokenizer to remember everything including variant characters, so I decided to add special processing in such cases.

df = pd.read_csv('name_list_valid.csv')
keywords = []
MAX_LEN = 32
alls = list(df.full_name)
batch_size = 100

for i in range(math.ceil(len(alls)/batch_size)):
    print(i)
    s2 = list(df.full_name[i*batch_size:(i+1)*batch_size])
    d = torch.LongTensor(pad_sequences([tokenizer.encode(s) for s in s2], maxlen=MAX_LEN, dtype="long", truncating="pre", padding="pre")).cuda()
    attention_mask = (d > 0) * 1
    output = model_token_cls(d, token_type_ids = None, attention_mask = attention_mask)
    result = output[0].to('cpu')
    pred_masks = torch.min(torch.argmax(result, dim=2), attention_mask.to('cpu'))
    d = d.to('cpu')

    pred_mask_squeeze = pred_masks.nonzero().squeeze()
    b = d[pred_mask_squeeze.T.numpy()]
    pred_mask_squeeze[:,1]=b
    for j in range(len(s2)):
        tmp = pred_mask_squeeze[pred_mask_squeeze[:,0] == j]
        s = tokenizer.convert_ids_to_tokens(tmp[:,1])
        #If the restoration result contains unknown, get the number of characters in the result from the beginning.
        if '[UNK]' in s:
            s = s2[j][0:len(s)]
        
        keywords.append(''.join(s))

Still, the judgment may be strange if the surname and the first name contain the same kanji, but it seems that it will be even better if the condition that the surname is continuous is added later. I found that even if I didn't actually make a list of surnames and names, I could learn well by putting the full name in BERT to some extent. This time, I'm not sure what I'm doing, but by properly creating teacher data, the same implementation can be applied to keyword extraction logic in sentences.