Depending on the DB, the surname and given name are stored together, and there may be a desire to mechanically separate them. It's surprisingly difficult to do this even with a complete list of surnames. BERT is a hot topic right now, but I would like to introduce it because I was able to separate surnames and given names with high accuracy by learning personal names.
The source code is quite long, so I will show it from the result. We were able to separate 1,200 verification data with an accuracy of ** 99.0% **. The contents of the verification data and a part of the prediction (surname only) are as follows.
last_name | first_name | full_name | pred | |
---|---|---|---|---|
89 | Shape | Mayu | Mayu Katabe | Shape |
1114 | Kumazoe | Norio | Norio Kumazoe | Kumazoe |
1068 | Kimoto | Souho | Kimoto Soho | Kimoto |
55 | Yashiki Takajin | Hiroki | Yashiki Takajin | Yashiki Takajin |
44 | Basic | Shodai | Kishodai | Basic |
The 12 failed cases are as follows. Toshikatsu Sabune is likely to be divided into Toshikatsu Sabane even for humans.
last_name | first_name | full_name | pred | |
---|---|---|---|---|
11 | Toshi Saburi | Win | Toshikatsu Sabane | Sabane |
341 | Brush | Kasumi | Brush Kasumi | Brush flower |
345 | Shinto | Shinichi | Shinichi Shinto | Makoto Shinto |
430 | Chestnut | Kanae | Kanae Kuri | Kurika |
587 | Keisuke | Nina | Kei Ryojina | Kei |
785 | Bansho | Good | Bansho | Turn |
786 | Yutaka | Wakana | Kana Toyowa | Toyokazu |
995 | Seri | Yu | Seriyoshi | Se |
1061 | So | Real princess | Somihime | Somi |
1062 | Instep | fruit | Kogi Nomi | Koki |
1155 | Hotaka | Natsuho | Hotaka Natsuho | Hotakaho |
1190 | Extremely average | dream | Extreme dream | very |
By the way, using only the preset dictionary (ipadic) with janome (a morphological analysis tool that is completed only with python and has the same performance as mecab) as follows With a simple first and last name separation, the accuracy was 34.5%.
def extract_last_name(sentence):
for token in tokenizer.tokenize(sentence):
if 'Surname' in token.part_of_speech:
return token.surface
df['pred'] = df['full_name'].apply(lambda full_name: extract_last_name(full_name))
Obtain the surname dictionary from the surname database, feed it to janome as shown below, and analyze it. Has improved to 79.7%.
tokenizer2 = Tokenizer('last_name_dic.csv', udic_enc="utf8")
def extract_last_name2(sentence):
token_arr = [token for token in tokenizer2.tokenize(sentence)]
if 'Surname' in token_arr[0].part_of_speech:
return token_arr[0].surface
df['pred2'] = df['full_name'].apply(lambda full_name: extract_last_nam2(full_name))
Perhaps adding a name list will further improve the accuracy, but it seemed to be quite difficult to obtain and process, so I would like to verify it if I have time. (I don't think so)
First, import what you need.
import pandas as pd
import numpy as np
from transformers import BertConfig, BertTokenizer, BertJapaneseTokenizer, BertForTokenClassification
from keras.preprocessing.sequence import pad_sequences
import torch
import MeCab
import math
As a preset, I think that there are many examples of using bert-base-japanese-whole-word-masking
, but this time it will be a problem if weird division is done at the time of tokenize, so this time one character at a time Use the char to split.
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-char-whole-word-masking')
I used Amazing Name Generator as the source of the teacher data. It is an interesting site just by looking at the unusualness of the name, which is quantified. This time, the number of names was 48,000 and the types of surnames were about 22,000. It is fairly widely distributed, including minor surnames. The format of csv is only full_name, last_name, first_name. First, tokenize each character with the following code.
df = pd.read_csv('name_list.csv')
text1s = list(df.full_name.values)
targets = list(df.last_name.values)
text1_tokenize = [tokenizer.encode(s) for s in text1s]
target_tokenize = [[tokenizer.encode(vv)[1:-1] for vv in v] for v in targets]
As a flow, after dividing full_name one character at a time, the probability that each character will be 1 for the surname and 0 for the first name will be given. In order for BERT to understand the correct answer data, for example, Taro Tanaka-> ['Ta','Middle','Ta','Ro']-> [1, 1, 0, 0] I will make it. attention_masks is a simple replacement of the target array with 1 (maybe unnecessary)
def make_tags_arr(x, token):
start_indexes = arr_indexes(x, token)
max_len = len(x)
token_len = len(token)
arr = [0] * max_len
for i in start_indexes:
arr[i:i+token_len] = [1] * token_len
return arr
tags_ids = []
for i in range(len(text1_tokenize)):
text1 = text1_tokenize[i]
targets = target_tokenize[i]
tmp = [0] * len(text1)
for t in targets:
# [0,0,1,1,0,0...]Make a tag array
arr = make_tags_arr(text1, t)
tmp = [min(x + y, 1) for (x, y) in zip(tmp, arr)]
tags_ids.append(tmp)
attention_masks = [[float(i > 0) for i in ii] for ii in text1_tokenize]
BERT must all have the same token array length, so padding is performed. Then split the dataset.
MAX_LEN = 32
input_ids = pad_sequences(text1_tokenize, maxlen=MAX_LEN, dtype="long", truncating="pre", padding="pre")
tags_ids = pad_sequences(tags_ids, maxlen=MAX_LEN, dtype="long", truncating="pre", padding="pre")
attention_masks = pad_sequences(attention_masks, maxlen=MAX_LEN, dtype="long", truncating="pre", padding="pre")
from sklearn.model_selection import train_test_split
RAN_SEED = 2020
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, tags_ids, random_state=RAN_SEED, test_size=0.1)
train_masks, validation_masks = train_test_split(attention_masks, random_state=RAN_SEED, test_size=0.2)
train_inputs = torch.LongTensor(train_inputs)
validation_inputs = torch.LongTensor(validation_inputs)
train_labels = torch.LongTensor(train_labels)
validation_labels = torch.LongTensor(validation_labels)
train_masks = torch.LongTensor(train_masks)
validation_masks = torch.LongTensor(validation_masks)
Use GPUorCPU to load the dataset. Then load the pre-trained model.
if torch.cuda.is_available():
# Tell PyTorch to use the GPU.
device = torch.device("cuda")
print('There are %d GPU(s) available.' % torch.cuda.device_count())
print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
print('No GPU available, using the CPU instead.')
device = torch.device("cpu")
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
batch_size = 32
# Create the DataLoader for our training set.
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
# Create the DataLoader for our validation set.
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)
from transformers import AdamW, BertConfig
model_token_cls = BertForTokenClassification.from_pretrained('cl-tohoku/bert-base-japanese-char-whole-word-masking', num_labels=2)
model_token_cls.cuda()
Shows an overview of the model. It is not related to processing, so you can skip it.
# Get all of the model's parameters as a list of tuples.
params = list(model_token_cls.named_parameters())
print('The BERT model has {:} different named parameters.\n'.format(len(params)))
print('==== Embedding Layer ====\n')
for p in params[0:5]:
print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
print('\n==== First Transformer ====\n')
for p in params[5:21]:
print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
print('\n==== Output Layer ====\n')
for p in params[-4:]:
print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
Define accuracy. Only used to display validation data. Here, it is the F1 value. For each character, the surname or first name is judged and if it is high, it approaches 1.
import datetime
def flat_accuracy(pred_masks, labels, input_masks):
tp = ((pred_masks == 1) * (labels == 1)).sum().item()
fp = ((pred_masks == 1) * (labels == 0)).sum().item()
fn = ((pred_masks == 0) * (labels == 1)).sum().item()
tn = ((pred_masks == 0) * (labels == 0)).sum().item()
precision = tp/(tp+fp)
recall = tp/(tp+fn)
f1 = 2*precision*recall/(precision+recall)
return f1
def format_time(elapsed):
# Round to the nearest second.
elapsed_rounded = int(round((elapsed)))
# Format as hh:mm:ss
return str(datetime.timedelta(seconds=elapsed_rounded))
This is the main part of the training.
from torch.optim import Adam
from transformers import get_linear_schedule_with_warmup
param_optimizer = list(model_token_cls.named_parameters())
no_decay = ["bias", "gamma", "beta"]
optimizer_grouped_parameters = [
{'params' : [p for n, p in param_optimizer if not any (nd in n for nd in no_decay)],
'weight_decay_rate' : 0.01},
{'params' : [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
'weight_decay_rate' : 0.0}
]
optimizer = Adam(optimizer_grouped_parameters, lr=3e-5)
epochs = 3
max_grad_norm = 1.0
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps = 0,
num_training_steps = total_steps)
#training
for epoch_i in range(epochs):
# TRAIN loop
model_token_cls.train()
train_loss = 0
nb_train_examples, nb_train_steps = 0, 0
t0 = time.time()
print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
print('Training...')
for step, batch in enumerate(train_dataloader):
if step % 40 == 0 and not step == 0:
elapsed = format_time(time.time() - t0)
print(' Batch {:>5,} of {:>5,}. Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))
b_input_ids = batch[0].to(device)
b_input_mask = batch[1].to(device)
b_labels = batch[2].to(device)
# forward pass
loss = model_token_cls(b_input_ids, token_type_ids = None, attention_mask = b_input_mask, labels = b_labels)
# backward pass
loss[0].backward()
# track train loss
train_loss += loss[0].item()
nb_train_examples += b_input_ids.size(0)
nb_train_steps += 1
# gradient clipping
torch.nn.utils.clip_grad_norm_(parameters = model_token_cls.parameters(), max_norm = max_grad_norm)
# update parameters
optimizer.step()
scheduler.step()
model_token_cls.zero_grad()
# Calculate the average loss over the training data.
avg_train_loss = train_loss / len(train_dataloader)
print("")
print(" Average training loss: {0:.2f}".format(avg_train_loss))
print(" Training epcoh took: {:}".format(format_time(time.time() - t0)))
# ========================================
# Validation
# ========================================
print("")
print("Running Validation...")
t0 = time.time()
model_token_cls.eval()
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
for batch in validation_dataloader:
batch = tuple(t.to(device) for t in batch)
b_input_ids, b_input_mask, b_labels = batch
with torch.no_grad():
outputs = model_token_cls(b_input_ids, token_type_ids = None, attention_mask = b_input_mask, labels = b_labels)
result = outputs[1].to('cpu')
labels = b_labels.to('cpu')
input_mask = b_input_mask.to('cpu')
# Mask predicted label
pred_masks = torch.min(torch.argmax(result, dim=2), input_mask)
# Calculate the accuracy for this batch of test sentences.
tmp_eval_accuracy = flat_accuracy(pred_masks, labels, input_mask)
# Accumulate the total accuracy.
eval_accuracy += tmp_eval_accuracy
# Track the number of batches
nb_eval_steps += 1
# Report the final accuracy for this validation run.
print(" Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
print(" Validation took: {:}".format(format_time(time.time() - t0)))
print("Train loss: {}".format(train_loss / nb_train_steps))
During training ... It took about 10 minutes.
Save the model you trained here.
pd.to_pickle(model_token_cls, 'First and last name separation model.pkl')
Based on the verification data prepared separately, we will actually enter the characters judged as surnames in keywords. The characters defined by the tokenizer above are 4,000 characters, and the characters not included in the list are [UNK]. It seems to be difficult for tokenizer to remember everything including variant characters, so I decided to add special processing in such cases.
df = pd.read_csv('name_list_valid.csv')
keywords = []
MAX_LEN = 32
alls = list(df.full_name)
batch_size = 100
for i in range(math.ceil(len(alls)/batch_size)):
print(i)
s2 = list(df.full_name[i*batch_size:(i+1)*batch_size])
d = torch.LongTensor(pad_sequences([tokenizer.encode(s) for s in s2], maxlen=MAX_LEN, dtype="long", truncating="pre", padding="pre")).cuda()
attention_mask = (d > 0) * 1
output = model_token_cls(d, token_type_ids = None, attention_mask = attention_mask)
result = output[0].to('cpu')
pred_masks = torch.min(torch.argmax(result, dim=2), attention_mask.to('cpu'))
d = d.to('cpu')
pred_mask_squeeze = pred_masks.nonzero().squeeze()
b = d[pred_mask_squeeze.T.numpy()]
pred_mask_squeeze[:,1]=b
for j in range(len(s2)):
tmp = pred_mask_squeeze[pred_mask_squeeze[:,0] == j]
s = tokenizer.convert_ids_to_tokens(tmp[:,1])
#If the restoration result contains unknown, get the number of characters in the result from the beginning.
if '[UNK]' in s:
s = s2[j][0:len(s)]
keywords.append(''.join(s))
Still, the judgment may be strange if the surname and the first name contain the same kanji, but it seems that it will be even better if the condition that the surname is continuous is added later. I found that even if I didn't actually make a list of surnames and names, I could learn well by putting the full name in BERT to some extent. This time, I'm not sure what I'm doing, but by properly creating teacher data, the same implementation can be applied to keyword extraction logic in sentences.
Recommended Posts