Previously posted articles Detecting text anomalies using Universal Sentence Encoder Now, using Universal Sentence Encoder (USE), the task of finding the text of the securities report mixed with the text of Soseki Natsume is anomaly detection of direction data. I treated it as a problem. This time, not only USE but also ELMo and BERT are used to solve the same kind of tasks, 3 Let's compare two encoder models.
Both ELMo and BERT, which have been pre-learned in Japanese, use the model published by Stockmark.
-Introduction of ELMo (using MeCab) model that learned large-scale Japanese business news corpus -Introduction of BERT pre-learned (using MeCab) model that learned large-scale Japanese business news corpus
All calculations were done on Google Colaboratory. BERT uses TensorFlow 1.x series and USE uses TensorFlow 2.x series, so work with multiple notebooks as shown below to divide the environment.
data_preparation.ipynb
――Data preparation
ʻELMo_BERT_embedding.ipynb ――Calculation of embedded vector by ELMo and BERT ʻUSE_embedding.ipynb
――Calculation of embedded vector by USE
ʻAnomaly_detection.ipynb` ―― Anomaly detection calculation
In the previous article, I mainly used the text of Natsume Soseki's novel and mixed the text of the securities report as abnormal data, but this time it is valuable because the stock mark model has been pre-learned in the corpus of the business domain. The main text is the securities report. Abnormal data will be collected from livedoor news corpus. The livedoor news corpus is not ordinary news, but a data set consisting of articles on home appliances, sports, and entertainment gossip.
First, import the required libraries and mount Google Drive.
data_preparation.ipynb
import re
import json
import glob
import numpy as np
from sklearn.model_selection import train_test_split
from google.colab import drive
drive.mount('/content/drive')
In the following, we will work in a directory called anomaly_detection under My Drive. Please replace the directory name as appropriate.
data_preparation.ipynb
%cd /content/drive/'My Drive'/anomaly_detection
First, download the chABSA dataset and extract only the text of the securities report. The procedure is the same as in the previous article.
data_preparation.ipynb
#Download and unpack the data
!wget https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/chABSA-dataset.zip
!unzip chABSA-dataset.zip
!rm chABSA-dataset.zip
#Create a list of paths to files
chabsa_path_list = glob.glob("chABSA-dataset/*.json")
#Store only the text part of the securities report in the list
chabsa_texts = []
for p in chabsa_path_list:
with open(p, "br") as f:
j = json.load(f)
for line in j["sentences"]:
chabsa_texts += [line["sentence"].replace('\n', '')]
print(len(chabsa_texts))
# 6119
Delete sentences that are too short and sentences that are too long.
data_preparation.ipynb
def filter_by_length(texts_input, min_len=20, max_len=300):
texts_output = []
for t in texts_input:
length = len(t)
if length >= min_len and length <= max_len:
texts_output.append(t)
return texts_output
chabsa_texts = filter_by_length(chabsa_texts)
print(len(chabsa_texts))
# 5148
Then download the livedoor news corpus. I will use sports-watch out of several articles.
data_preparation.ipynb
#Download and unpack the data
!wget https://www.rondhuit.com/download/ldcc-20140209.tar.gz
!tar -xf ldcc-20140209.tar.gz
!rm ldcc-20140209.tar.gz
#Create a list of paths to files
livedoor_path_list = glob.glob('./text/sports-watch/sports-watch-*.txt')
len(livedoor_path_list)
# 900
Since the text of the livedoor news corpus contains many symbols such as ■, preprocessing is performed excluding the listed symbols. In addition, I will exclude the entire sentence including the URL. In addition, exclude lines that consist only of alphanumeric characters, such as lines related to copyright.
data_preparation.ipynb
def cleaning_text(texts):
#Excludes sentences containing URLs
p = re.compile('https?://')
if p.search(texts):
return ''
#Excludes lines with only alphanumeric characters
p = re.compile('[\w /:%#\$&\?\(\)~\.,=\+\-…()<>]+')
if p.fullmatch(texts):
return ''
#Excludes line breaks, double-byte spaces, and frequent symbols
remove_list = ['\n', '\u3000', ' ', '<', '>', '・',
'■', '□', '●', '○', '◆', '◇',
'★', '☆', '▲', '△', '※', '*', '*', '——']
for s in remove_list:
texts = texts.replace(s, '')
return texts
livedoor_texts = []
for path in livedoor_path_list:
with open(path) as f:
texts = f.readlines()
livedoor_texts += [cleaning_text(t) for t in texts[4:]] #No need for the first 3 lines
print(len(livedoor_texts))
Divide the sentence by punctuation to match the sentence and format of the chABSA data, and filter by the length of the sentence.
data_preparation.ipynb
def split_texts(texts_input, split_by='。'):
texts_output = []
for t in texts_input:
texts_output += t.split(split_by)
return texts_output
livedoor_texts = split_texts(livedoor_texts)
print(len(livedoor_texts))
# 17522
livedoor_texts = filter_by_length(livedoor_texts)
print(len(livedoor_texts))
# 8149
Now that we have a list of two types of sentences, we will create model development data and test data. Label 0 is given to the text of the securities report, and label 1 is given to the text of livedoor news. 80% of the text of the securities report will be used for development data, and the remaining 20% will be used for test data. Only about 1% of livedoor news text is mixed in the development data. The amount of the two types of sentences in the test data is 50% each so that the accuracy can be easily evaluated.
data_preparation.ipynb
def make_dataset(main_texts, anom_texts, main_dev_rate=0.8, anom_dev_rate=0.01):
len1 = len(main_texts)
len2 = len(anom_texts)
num_dev1 = int(len1 * main_dev_rate)
num_dev2 = int(num_dev1 * anom_dev_rate)
num_test1 = len1 - num_dev1
num_test2 = num_test1
print("Development data main: {}, anom: {}".format(num_dev1, num_dev2))
print("Test data main: {}, anom: {}".format(num_test1, num_test2))
main_arr = np.hstack([np.reshape(np.zeros(len1), (-1, 1)),
np.reshape(main_texts, (-1, 1))])
anom_arr = np.hstack([np.reshape(np.ones(len2), (-1, 1)),
np.reshape(anom_texts, (-1, 1))])
dev1, test1 = train_test_split(main_arr, train_size=num_dev1)
dev2, test2 = train_test_split(anom_arr, train_size=num_dev2)
test2, _ = train_test_split(test2, train_size=num_test2)
dev_arr = np.vstack([dev1, dev2])
np.random.shuffle(dev_arr)
test_arr = np.vstack([test1, test2])
np.random.shuffle(test_arr)
return dev_arr, test_arr
dev_arr, test_arr = make_dataset(chabsa_texts, livedoor_texts)
#Development data main: 4118, anom: 41
#Test data main: 1030, anom: 1030
print(dev_arr.shape, test_arr.shape)
# (4159, 2) (2060, 2)
Save the dataset you created.
data_preparation.ipynb
np.save('dev.npy', dev_arr)
np.save('test.npy', test_arr)
The following is a sample of some test data.
1: Rakuten's negligent play that would have been a big flame if it had been defeated by Seibu as it was 0: The fund management balance was 17.9 billion yen (up 8.4% year-on-year) due to an increase in interest on loans, although interest and dividends on securities decreased. 1: I've known him since he was 18 years old, but now he's really grown up as a kid 0: With the establishment of the business succession navigator, we will carry out sales activities to meet the needs of mid-cap (mid-sized companies) managers who want to select the best business succession plan from several options over a relatively long span. I am going 0: Segment profit was 3,977 due to an increase in quality-related expenses due to airbag inflators and various expenses centered on selling, general and administrative expenses due to rising interest rates in the United States, the effects of exchange rate fluctuations, and an increase in test and research expenses. Profit decreased by 146 billion yen (26.8%) from the previous consolidated fiscal year to 100 million yen. 1: Also, since we came, I think there was one time I wanted the director to approve it. 1: And the 1st place was "I wonder if that was the only one that kicked perfectly according to the image. 0: Although the handling of imported air cargo remained firm, sales were 101.7 billion yen, down 13.3 billion yen, or 11.6% from the previous consolidated fiscal year, and operating income was 1.1 billion yen due to the effects of foreign exchange. Profit decreased by 500 million yen, 33.5% compared to the previous consolidated fiscal year 0: In the Japanese economy, while the employment and income environment continued to improve, the economy continued to show a gradual recovery trend, although some delays in improvement were seen. 0: Regarding the PC content distribution business, we will operate paid fan club sites for PCs such as artists and talents, and carry out contract production of official sites, which will lead to new profits in the future, including other business divisions. We have been developing our business with that in mind.
From here you will work on the notebook ELMo_BERT_embedding.ipynb. First, mount Google Drive in the same way as before.
ELMo_BERT_embedding.ipynb
from google.colab import drive
drive.mount('/content/drive')
The stockmark pre-learning model uses the MeCab + NEologd dictionary as the tokenizer. Follow the Official Distribution Page to install the MeCab + NEologd dictionary. First, execute the following command on the notebook to install MeCab itself.
ELMo_BERT_embedding.ipynb
!apt install aptitude swig
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
Then install the NEologd dictionary. If you work under the directory'My Drive', you will get an error because the directory name contains spaces, so work in the root directory.
ELMo_BERT_embedding.ipynb
%cd /content
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
%cd mecab-ipadic-neologd
!echo yes | ./bin/install-mecab-ipadic-neologd -n
%cd /content/drive/'My Drive'/anomaly_detection2
Finally, install the library for calling MeCab in python.
ELMo_BERT_embedding.ipynb
!pip install mecab-python3
import MeCab
In the Japanese version of ELMo used this time, it is necessary to enter the token divided by MeCab, so read the dataset created earlier and tokenize it.
ELMo_BERT_embedding.ipynb
%cd /content/drive/'My Drive'/anomaly_detection
dev_arr = np.load('dev.npy')
test_arr = np.load('test.npy')
#Cut out only sentences
dev_texts = dev_arr[:, 1]
test_texts = test_arr[:, 1]
def MeCab_tokenizer(texts):
mecab = MeCab.Tagger(
"-Owakati -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd")
token_list = []
for s in texts:
parsed = mecab.parse(s).replace('\n', '')
token_list += [parsed.split()]
return token_list
dev_tokens = MeCab_tokenizer(dev_texts)
test_tokens = MeCab_tokenizer(test_texts)
Also, in the BERT program used this time, it is necessary to input from a text file, so write it in the file. This does not need to be tokenized in advance.
ELMo_BERT_embedding.ipynb
with open('dev_text.txt', mode='w') as f:
f.write('\n'.join(dev_texts))
with open('test_text.txt', mode='w') as f:
f.write('\n'.join(test_texts))
ELMo The ELMo model uses this implementation (https://github.com/HIT-SCIR/ELMoForManyLangs) with stockmark's pre-trained parameters. The specific usage is
I referred to. First, install the required libraries and clone the repository.
ELMo_BERT_embedding.ipynb
!pip install overrides
!git clone https://github.com/HIT-SCIR/ELMoForManyLangs.git
Run setup.py
to complete the installation.
ELMo_BERT_embedding.ipynb
%cd ./ELMoForManyLangs
!python setup.py install
%cd ..
Then download the stockmark pre-trained model from this link. There are two types of models: "word-based embedded model" and "character-based / word-based embedded model". This time, we will use the "word-based embedding model". I placed the downloaded file in the mounted Google Drive folder ./ELMo_ja_word_level
. The contents of the folder should look like this:
ELMo_BERT_embedding.ipynb
!ls ./ELMo_ja_word_level/
# char.dic config.json configs encoder.pkl token_embedder.pkl word.dic
The model is now ready. ʻCreate an Embedder` instance and define a function to calculate the embedded vector of the text.
ELMo_BERT_embedding.ipynb
from ELMoForManyLangs.elmoformanylangs import Embedder
from overrides import overrides
elmo_model_path = "./ELMo_ja_word_level"
elmo_embedder = Embedder(elmo_model_path, batch_size=64)
def get_elmo_embeddings(token_list, batch_size=64):
length = len(token_list)
n_loop = int(length / batch_size) + 1
sent_emb = []
for i in range(n_loop):
token_emb = elmo_embedder.sents2elmo(
token_list[batch_size*i: min(batch_size*(i+1), length)])
for emb in token_emb:
# sum over tokens to obtain the embedding for a sentence
sent_emb.append(sum(emb))
return np.array(sent_emb)
The output of the model is an embedded vector for each input token. The sum of all the token vectors in the text is used as the text embedding vector. The dimension of the ELMo vector is 1024.
ELMo_BERT_embedding.ipynb
dev_elmo_embeddings = get_elmo_embeddings(dev_tokens)
test_elmo_embeddings = get_elmo_embeddings(test_tokens)
print(dev_elmo_embeddings.shape, test_elmo_embeddings.shape)
# (4159, 1024) (2060, 1024)
np.save('dev_elmo_embeddings.npy', dev_elmo_embeddings)
np.save('test_elmo_embeddings.npy', test_elmo_embeddings)
BERT
For BERT, we will use the pre-trained model published by Stockmark. I downloaded the TensorFlow version from this Download link and placed it in the folder ./BERT_base_stockmark
. The contents of the folder are as follows.
ELMo_BERT_embedding.ipynb
!ls ./BERT_base_stockmark
# bert_config.json vocab.txt output_model.ckpt.index output_model.ckpt.meta output_model.ckpt.data-00000-of-00001
Import tensorflow by specifying version 1.x series.
ELMo_BERT_embedding.ipynb
%tensorflow_version 1.x
import tensorflow as tf
Clone the model body from the Official Repository.
ELMo_BERT_embedding.ipynb
!git clone https://github.com/google-research/bert.git
The official repository has a script code ʻextract_features.py` for extracting the embedded vector, so if it is the original English version, you can just execute it, but in order to use the Japanese pre-learning model Needs to edit some files.
Change tokenization.py
to use MeCab as a tokenizer according to the instructions on how to use this page. First, change the function convert_tokens_to_ids (vocab, tokens)
as follows so that it returns 1 which is the id of [UNK] for unknown words.
bert/tokenization.py
def convert_tokens_to_ids(vocab, tokens):
# modify so that it returns id=1 which means [UNK] when a token is not in vocab
output = []
for t in tokens:
if t in vocab.keys():
i = vocab[t]
else: # if t is [UNK]
i = 1
output.append(i)
return output
In addition, rewrite the class FullTokenizer (object)
as follows to use MecabTokenizer instead of WordpieceTokenizer. Other Japanese pre-learned BERTs have MeCab or Human ++ divided into morphemes and then the Wordpiece Tokenizer is applied, but the stockmark model uses only Mecab.
bert/tokenization.py
class FullTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
#self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
#self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
# use Mecab
self.mecab_tokenizer = MecabTokenizer()
def tokenize(self, text):
split_tokens = []
# for token in self.basic_tokenizer.tokenize(text):
# use Mecab
for token in self.mecab_tokenizer.tokenize(text):
split_tokens.append(token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
#return convert_by_vocab(self.vocab, tokens)
return convert_tokens_to_ids(self.vocab, tokens)
def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)
Finally, add the class MecabTokenizer
that inherits BasicTokenizer
.
bert/tokenization.py
class MecabTokenizer(BasicTokenizer):
def __init__(self):
import MeCab
path = "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"
self._mecab = MeCab.Tagger(path)
def tokenize(self, text):
"""Tokenizes a piece of text."""
text = convert_to_unicode(text.replace(' ', ''))
text = self._clean_text(text)
mecab_result = self._mecab.parseToNode(text)
split_tokens = []
while mecab_result:
if mecab_result.feature.split(",")[0] != 'BOS/EOS':
split_tokens.append(mecab_result.surface)
mecab_result = mecab_result.next
output_tokens = whitespace_tokenize(" ".join(split_tokens))
print(split_tokens)
return output_tokens
Now you are ready to use the Japanese pre-learned BERT, but ʻextract_features.py outputs a file that stores the embedded vectors for all tokens of the entered sentence, so it will be output if there are many sentences. The file size will be large. In this task, we will use only the text embedding vector, not the embedding vector of each token, so we will rewrite ʻextract_features.py
to output only that. As the embedding vector of the sentence, the average of the vectors of all tokens is used as in ELMo. Also, for the layer that extracts the embedded vector, the original code outputs the vector of the specified layer respectively, but change it so that the vectors of all the specified layers are averaged. The output is saved in numpy npy format, so in the header part
bert/extract_features.py
import numpy as np
Is added. In addition, change the last part of the main function, ʻinput_fn = input_fn_builder (after the` line, as follows:
bert/extract_features.py
input_fn = input_fn_builder(
features=features, seq_length=FLAGS.max_seq_length)
arr = []
for result in estimator.predict(input_fn, yield_single_examples=True):
cnt = 0
for (i, token) in enumerate(feature.tokens):
for (j, layer_index) in enumerate(layer_indexes):
layer_output = result["layer_output_%d" % j]
if token != '[CLS]' and token != '[SEP]':
if cnt == 0:
averaged_emb = np.array(layer_output[i:(i + 1)].flat)
else:
averaged_emb += np.array(layer_output[i:(i + 1)].flat)
cnt += 1
averaged_emb /= cnt
arr += [averaged_emb]
np.save(FLAGS.output_file, arr)
Now that you're ready, run the following command in your notebook ELMo_BERT.ipynb. As the layer for extracting the embedding vector, the latter 5 layers excluding the final layer will be used.
ELMo_BERT_embedding.ipynb
#BERT execution dev
!python ./bert/extract_features.py \
--input_file=dev_text.txt \
--output_file=dev_bert_embeddings.npy \
--vocab_file=./BERT_base_stockmark/vocab.txt \
--bert_config_file=./BERT_base_stockmark/bert_config.json \
--init_checkpoint=./BERT_base_stockmark/output_model.ckpt \
--layers -2,-3,-4,-5,-6
ELMo_BERT_embedding.ipynb
#BERT execution test
!python ./bert/extract_features.py \
--input_file=test_text.txt \
--output_file=test_bert_embeddings.npy \
--vocab_file=./BERT_base_stockmark/vocab.txt \
--bert_config_file=./BERT_base_stockmark/bert_config.json \
--init_checkpoint=./BERT_base_stockmark/output_model.ckpt \
--layers -2,-3,-4,-5,-6
USE USE uses a lightweight version of the model learned in multiple languages including Japanese (Multilingual, CNN version, v3) .. The usage is as explained in Previous article. TensorFlow uses version 2.x series.
USE_embedding.ipynb
import tensorflow_hub as hub
import tensorflow_text
import tensorflow as tf
from google.colab import drive
drive.mount('/content/drive')
USE_embedding.ipynb
%cd /content/drive/'My Drive'/anomaly_detection
use_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual/3'
embed = hub.load(use_url)
def get_use_embeddings(texts, batch_size=100):
length = len(texts)
n_loop = int(length / batch_size) + 1
embeddings = use_embedder(texts[: batch_size])
for i in range(1, n_loop):
arr = use_embedder(texts[batch_size*i: min(batch_size*(i+1), length)])
embeddings = tf.concat([embeddings, arr], axis=0)
return np.array(embeddings)
USE_embedding.ipynb
dev_use_embeddings = get_use_embeddings(dev_arr[:, 1])
test_use_embeddings = get_use_embeddings(test_arr[:, 1])
print(dev_use_embeddings.shape, test_use_embeddings.shape)
# (4159, 512) (2060, 512)
np.save('dev_use_embeddings.npy', dev_use_embeddings)
np.save('test_use_embeddings.npy', test_use_embeddings)
Now that we have the embedded vectors for the three models, we will build an anomaly detection model and compare the accuracy. Load development / test data and embedded vector data on the notebook anomaly_detection.ipynb.
anomaly_detection.ipynb
import numpy as np
from scipy.stats import chi2
from google.colab import drive
drive.mount('/content/drive')
anomaly_detection.ipynb
%cd /content/drive/'My Drive'/anomaly_detection
dev_arr = np.load('dev.npy')
test_arr = np.load('test.npy')
dev_elmo_embeddings = np.load('dev_elmo_embeddings.npy')
test_elmo_embeddings = np.load('test_elmo_embeddings.npy')
dev_bert_embeddings = np.load('dev_bert_embeddings.npy')
test_bert_embeddings = np.load('test_bert_embeddings.npy')
dev_use_embeddings = np.load('dev_use_embeddings.npy')
test_use_embeddings = np.load('test_use_embeddings.npy')
print(dev_elmo_embeddings.shape, test_elmo_embeddings.shape)
# (4159, 1024) (2060, 1024)
print(dev_bert_embeddings.shape, test_bert_embeddings.shape)
# (4159, 768) (2060, 768)
print(dev_use_embeddings.shape, test_use_embeddings.shape)
# (4159, 512) (2060, 512)
For details on the anomaly detection model, refer to Previous article. The model code is summarized in the following class DirectionalAnomalyDetection
.
anomaly_detection.ipynb
class DirectionalAnomalyDetection:
def __init__(self, dev_embeddings, test_embeddings, dev_arr):
self.dev_embeddings = self.normalize_arr(dev_embeddings)
self.test_embeddings = self.normalize_arr(test_embeddings)
self.dev_arr = dev_arr
self.mu, self.anom = self.calc_anomality(self.dev_embeddings)
self.mhat, self.shat = self.calc_chi2params(self.anom)
print("mhat: {:.3f}".format(self.mhat))
print("shat: {:.3e}".format(self.shat))
self.anom_test = None
def normalize_arr(self, arr):
norm = np.apply_along_axis(lambda x: np.linalg.norm(x), 1, arr) # norm of each vector
normed_arr = arr / np.reshape(norm, (-1,1))
return normed_arr
def calc_anomality(self, embeddings):
mu = np.mean(embeddings, axis=0)
mu /= np.linalg.norm(mu)
anom = 1 - np.inner(mu, embeddings)
return mu, anom
def calc_chi2params(self, anom):
anom_mean = np.mean(anom)
anom_mse = np.mean(anom**2) - anom_mean**2
mhat = 2 * anom_mean**2 / anom_mse
shat = 0.5 * anom_mse / anom_mean
return mhat, shat
def decide_ath_by_alpha(self, alpha, x_ini, max_ite=100, eps=1.e-12):
# Newton's method
x = x_ini
for i in range(max_ite):
xnew = x - (chi2.cdf(x, self.mhat, loc=0, scale=self.shat)
- (1 - alpha)) / chi2.pdf(x, self.mhat, loc=0, scale=self.shat)
if abs(xnew - x) < eps:
print("iteration: ", i+1)
break
x = xnew
print("ath: {:.4f}".format(x))
return x
def decide_ath_by_labels(self, x_ini, max_ite=100, eps=1.e-12):
anom0 = self.anom[self.dev_arr[:, 0] == '0.0']
anom1 = self.anom[self.dev_arr[:, 0] == '1.0']
mhat0, shat0 = self.calc_chi2params(anom0)
mhat1, shat1 = self.calc_chi2params(anom1)
ath = self._find_crossing_point(mhat0, shat0, mhat1, shat1, x_ini, max_ite, eps)
print("ath: {:.4f}".format(ath))
return ath
def _find_crossing_point(self, mhat0, shat0, mhat1, shat1, x_ini, max_ite, eps):
# Newton's method
x = x_ini
for i in range(max_ite):
xnew = x - self._crossing_func(x, mhat0, shat0, mhat1, shat1)
if abs(xnew - x) < eps:
print("iteration: ", i+1)
break
x = xnew
return x
def _crossing_func(self, x, mhat0, shat0, mhat1, shat1):
chi2_0 = chi2.pdf(x, mhat0, loc=0, scale=shat0)
chi2_1 = chi2.pdf(x, mhat1, loc=0, scale=shat1)
nume = x * (chi2_0 - chi2_1)
deno = (mhat0 * 0.5 - 1 - x / shat0 * 0.5) * chi2_0 \
-(mhat1 * 0.5 - 1 - x / shat1 * 0.5) * chi2_1
return nume / deno
def predict(self, ath):
self.anom_test = 1 - np.inner(self.mu, self.test_embeddings)
predict_arr = (self.anom_test > ath).astype(np.int)
return predict_arr
Create a model instance for each of the three types of data. The parameters $ \ hat {m} $ and $ \ hat {s} $ when the anomaly distribution is fitted with the chi-square distribution are output.
anomaly_detection.ipynb
#-- ELMo
dad_elmo = DirectionalAnomalyDetection(dev_elmo_embeddings, test_elmo_embeddings, dev_arr)
# mhat: 6.813
# shat: 3.011e-03
#-- BERT
dad_bert = DirectionalAnomalyDetection(dev_bert_embeddings, test_bert_embeddings, dev_arr)
# mhat: 20.358
# shat: 6.912e-03
#-- USE
dad_use = DirectionalAnomalyDetection(dev_use_embeddings, test_use_embeddings, dev_arr)
# mhat: 36.410
# shat: 1.616e-02
The effective dimension $ \ hat {m} $ for any model is much smaller than the actual embedded vector dimension. It is interesting that the larger the dimension of the actual embedded vector (ELMo: 1024, BERT: 768, USE: 512), the smaller the effective dimension. The chi-square distribution with the parameters determined in this way is plotted together with the actual distribution of the degree of anomaly as follows. Reflecting the small anomalous dimension, the peak value of the anomaly distribution is also small in BERT and especially ELMo. It can also be seen that the fit of the chi-square distribution is also relatively poor for ELMo and BERT.
<img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/fadbad99-0e34-1c35-55c2-699617b0f370.png ", height=200><img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/e4c3dee9-2af5-d68a-966c-59ebebf7c52a.png ", height=200><img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/909631e1-a7f4-8eb0-14f1-f3d2414b7b1f.png ", height=200>
Next, find the threshold $ a_ \ text {th} $ for the degree of anomaly. In the previous article, the formula for the pre-determined false alarm rate $ \ alpha $
$$ 1- \ alpha = \ int_0 ^ {a_ \ text {th}} \! dx , \ chi ^ 2 (x | \ hat {m}, \ hat {s}) Threshold $ by solving $$ Determined a_ \ text {th} $. This method is implemented in the method decide_ath_by_alpha
of the class defined above. Only the results are shown, but this method results in significantly lower classification accuracy for test data in any of the three models. The value of the false alarm rate is 0.01, which is the same as the previous time.
Accuracy | Precision | Recall | F1 | |
---|---|---|---|---|
ELMo | 0.568 | 0.973 | 0.141 | 0.246 |
BERT | 0.580 | 0.946 | 0.170 | 0.288 |
USE | 0.577 | 0.994 | 0.155 | 0.269 |
To see why it is so inaccurate, plot the outlier histogram for the test data separately for normal and outlier data. The mountain on the left is normal data, and the mountain on the right is abnormal data. The red vertical line represents the threshold $ a_ \ text {th} $ obtained by decide_ath_by_alpha
. Only the results for USE are shown here, but the results for other models are similar. You can see that the normal data and the abnormal data are separated to some extent, but most of the data is classified as normal data because the threshold value is too large.
<img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/800c812d-897d-8472-bac2-0f6bf5431e97.png ", height=240>
In the results shown in the previous article, the chi-square distribution fits well with respect to the distribution of the degree of anomaly, and the distribution of normal data and the distribution of anomalous data are fairly clearly separated. Was obtained. However, for this data, it did not work because the overlap between the normal data and the abnormal data distribution is relatively large.
Now, I would like to determine the threshold $ a_ \ text {th} $ by another method, but there seems to be no further way in the setting of unsupervised learning. Therefore, we will set the threshold value in the setting of supervised learning that the normal / abnormal flag of the development data is known. Specifically, take the following steps.
decide_ath_by_labels
of the class DirectionalAnomalyDetection
, so execute it as follows.anomaly_detection.ipynb
#-- ELMo
ath_elmo = dad_elmo.decide_ath_by_labels(x_ini=0.02)
# iteration: 5
# ath: 0.0312
#-- BERT
ath_bert = dad_bert.decide_ath_by_labels(x_ini=0.2)
# iteration: 6
# ath: 0.1740
#-- USE
ath_use = dad_use.decide_ath_by_labels(x_ini=0.8)
# iteration: 5
# ath: 0.7924
Predict the test data using the obtained threshold value and evaluate the accuracy.
anomaly_detection.ipynb
#Forecast
predict_elmo = dad_elmo.predict(ath_elmo)
predict_bert = dad_bert.predict(ath_bert)
predict_use = dad_use.predict(ath_use)
#Correct answer data
answer = test_arr[:, 0].astype(np.float)
anomaly_detection.ipynb
#Function for accuracy evaluation
def calc_precision(answer, predict, title, export_fig=False):
acc = accuracy_score(answer, predict)
precision = precision_score(answer, predict)
recall = recall_score(answer, predict)
f1 = f1_score(answer, predict)
cm = confusion_matrix(answer, predict)
print("Accuracy: {:.3f}, Precision: {:.3f}, Recall: {:.3f}, \
F1: {:.3f}".format(acc, precision, recall, f1))
plt.rcParams["font.size"] = 18
cmd = ConfusionMatrixDisplay(cm, display_labels=[0,1])
cmd.plot(cmap=plt.cm.Blues, values_format='d')
plt.title(title)
if export_fig:
plt.savefig("./cm_" + title + ".png ", bbox_inches='tight')
plt.show()
return [acc, precision, recall, f1]
anomaly_detection.ipynb
#Accuracy evaluation
_ = calc_precision(answer, predict_elmo, title='ELMo', export_fig=True)
_ = calc_precision(answer, predict_bert2, title='BERT', export_fig=True)
_ = calc_precision(answer, predict_use, title='USE', export_fig=True)
The results are as follows.
Accuracy | Precision | Recall | F1 | |
---|---|---|---|---|
ELMo | 0.916 | 0.910 | 0.923 | 0.917 |
BERT | 0.851 | 0.847 | 0.858 | 0.852 |
USE | 0.946 | 0.931 | 0.963 | 0.947 |
Confusion matrix <img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/ec194af5-3dc1-836f-0739-18e3dceb6529.png ", height=200><img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/a539e4d5-161d-2c87-5fc0-5da16b9780ee.png ", height=200><img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/70faf102-db47-c39f-4908-d8dc529cad3f.png ", height=200>
By using the thresholds determined in the supervised learning settings, the accuracy of any model is greatly improved. Comparing the three models, the accuracy of USE is the highest in all indicators, and conversely the accuracy of BERT is the lowest.
Finally, plot the degree of abnormality of the test data for each model separately for normal data and abnormal data. <img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/98cee787-b773-e0fe-b10c-09262fc9440c.png ", height=200><img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/c57dbbdc-4b5e-06cf-d542-b6b35c5344e5.png ", height=200><img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/fe357b32-518e-58b8-07eb-fd1e2c74bb70.png ", height=200> The vertical dotted red line represents the threshold value obtained using the correct label. It is a threshold value determined from the development data that contains only 1% of abnormal data, but it can be seen that it is certainly located at the intersection of the distribution of normal data and abnormal data. Regarding BERT, it can be seen that the distribution of normal data and abnormal data overlaps greatly, and therefore the accuracy is the lowest.
Since the threshold of the degree of anomaly could not be set properly in the setting of unsupervised learning as in the previous article, we created an anomaly detection model with supervised learning. Even with supervised learning, the correct label was only used to determine the threshold of the degree of anomaly, so it is the quality of the encoder model that determines the accuracy, that is, how well the distribution of the degree of anomaly of normal data and abnormal data is separated. It depended on what I could do. As a result, the most accurate encoder model was the USE. Is the task related to cosine similarity still what USE is good at?
Recommended Posts