[PYTHON] Use ELMo, BERT, USE to detect anomalies in sentences

Overview

Previously posted articles Detecting text anomalies using Universal Sentence Encoder Now, using Universal Sentence Encoder (USE), the task of finding the text of the securities report mixed with the text of Soseki Natsume is anomaly detection of direction data. I treated it as a problem. This time, not only USE but also ELMo and BERT are used to solve the same kind of tasks, 3 Let's compare two encoder models.

Both ELMo and BERT, which have been pre-learned in Japanese, use the model published by Stockmark.

-Introduction of ELMo (using MeCab) model that learned large-scale Japanese business news corpus -Introduction of BERT pre-learned (using MeCab) model that learned large-scale Japanese business news corpus

environment

All calculations were done on Google Colaboratory. BERT uses TensorFlow 1.x series and USE uses TensorFlow 2.x series, so work with multiple notebooks as shown below to divide the environment. data_preparation.ipynb ――Data preparation ʻELMo_BERT_embedding.ipynb ――Calculation of embedded vector by ELMo and BERT ʻUSE_embedding.ipynb ――Calculation of embedded vector by USE ʻAnomaly_detection.ipynb` ―― Anomaly detection calculation

Data preparation

In the previous article, I mainly used the text of Natsume Soseki's novel and mixed the text of the securities report as abnormal data, but this time it is valuable because the stock mark model has been pre-learned in the corpus of the business domain. The main text is the securities report. Abnormal data will be collected from livedoor news corpus. The livedoor news corpus is not ordinary news, but a data set consisting of articles on home appliances, sports, and entertainment gossip.

First, import the required libraries and mount Google Drive.

data_preparation.ipynb


import re
import json
import glob
import numpy as np
from sklearn.model_selection import train_test_split

from google.colab import drive
drive.mount('/content/drive')

In the following, we will work in a directory called anomaly_detection under My Drive. Please replace the directory name as appropriate.

data_preparation.ipynb


%cd /content/drive/'My Drive'/anomaly_detection

First, download the chABSA dataset and extract only the text of the securities report. The procedure is the same as in the previous article.

data_preparation.ipynb


#Download and unpack the data
!wget https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/chABSA-dataset.zip
!unzip chABSA-dataset.zip
!rm chABSA-dataset.zip

#Create a list of paths to files
chabsa_path_list = glob.glob("chABSA-dataset/*.json")

#Store only the text part of the securities report in the list
chabsa_texts = []
for p in chabsa_path_list:
    with open(p, "br") as f:
        j =  json.load(f)
    for line in j["sentences"]:
        chabsa_texts += [line["sentence"].replace('\n', '')]

print(len(chabsa_texts))
# 6119

Delete sentences that are too short and sentences that are too long.

data_preparation.ipynb


def filter_by_length(texts_input, min_len=20, max_len=300):
    texts_output = []
    for t in texts_input:
        length = len(t)
        if length >= min_len and length <= max_len:
            texts_output.append(t)
    return texts_output

chabsa_texts = filter_by_length(chabsa_texts)
print(len(chabsa_texts))
# 5148

Then download the livedoor news corpus. I will use sports-watch out of several articles.

data_preparation.ipynb


#Download and unpack the data
!wget https://www.rondhuit.com/download/ldcc-20140209.tar.gz
!tar -xf ldcc-20140209.tar.gz
!rm ldcc-20140209.tar.gz

#Create a list of paths to files
livedoor_path_list = glob.glob('./text/sports-watch/sports-watch-*.txt')
len(livedoor_path_list)
# 900

Since the text of the livedoor news corpus contains many symbols such as ■, preprocessing is performed excluding the listed symbols. In addition, I will exclude the entire sentence including the URL. In addition, exclude lines that consist only of alphanumeric characters, such as lines related to copyright.

data_preparation.ipynb


def cleaning_text(texts):
    #Excludes sentences containing URLs
    p = re.compile('https?://')
    if p.search(texts):
        return ''
    #Excludes lines with only alphanumeric characters
    p = re.compile('[\w /:%#\$&\?\(\)~\.,=\+\-…()<>]+')
    if p.fullmatch(texts):
        return ''
    #Excludes line breaks, double-byte spaces, and frequent symbols
    remove_list = ['\n', '\u3000', ' ', '<', '>', '・', 
                   '■', '□', '●', '○', '◆', '◇', 
                   '★', '☆', '▲', '△', '※', '*', '*', '——']
    for s in remove_list:
        texts = texts.replace(s, '')
    return texts

livedoor_texts = []
for path in livedoor_path_list:
    with open(path) as f:
        texts = f.readlines()
    livedoor_texts += [cleaning_text(t) for t in texts[4:]] #No need for the first 3 lines
print(len(livedoor_texts))

Divide the sentence by punctuation to match the sentence and format of the chABSA data, and filter by the length of the sentence.

data_preparation.ipynb


def split_texts(texts_input, split_by='。'):
    texts_output = []
    for t in texts_input:
        texts_output += t.split(split_by)
    return texts_output

livedoor_texts = split_texts(livedoor_texts)
print(len(livedoor_texts))
# 17522

livedoor_texts = filter_by_length(livedoor_texts)
print(len(livedoor_texts))
# 8149

Now that we have a list of two types of sentences, we will create model development data and test data. Label 0 is given to the text of the securities report, and label 1 is given to the text of livedoor news. 80% of the text of the securities report will be used for development data, and the remaining 20% will be used for test data. Only about 1% of livedoor news text is mixed in the development data. The amount of the two types of sentences in the test data is 50% each so that the accuracy can be easily evaluated.

data_preparation.ipynb


def make_dataset(main_texts, anom_texts, main_dev_rate=0.8, anom_dev_rate=0.01):
    len1 = len(main_texts)
    len2 = len(anom_texts)
    num_dev1 = int(len1 * main_dev_rate)
    num_dev2 = int(num_dev1 * anom_dev_rate)
    num_test1 = len1 - num_dev1
    num_test2 = num_test1

    print("Development data main: {}, anom: {}".format(num_dev1, num_dev2))
    print("Test data main: {}, anom: {}".format(num_test1, num_test2))

    main_arr = np.hstack([np.reshape(np.zeros(len1), (-1, 1)), 
                          np.reshape(main_texts, (-1, 1))])
    anom_arr = np.hstack([np.reshape(np.ones(len2), (-1, 1)), 
                          np.reshape(anom_texts, (-1, 1))])
    
    dev1, test1 = train_test_split(main_arr, train_size=num_dev1)
    dev2, test2 = train_test_split(anom_arr, train_size=num_dev2)
    test2, _ = train_test_split(test2, train_size=num_test2)

    dev_arr = np.vstack([dev1, dev2])
    np.random.shuffle(dev_arr)
    test_arr = np.vstack([test1, test2])
    np.random.shuffle(test_arr)
    return dev_arr, test_arr

dev_arr, test_arr = make_dataset(chabsa_texts, livedoor_texts)
#Development data main: 4118, anom: 41
#Test data main: 1030, anom: 1030

print(dev_arr.shape, test_arr.shape)
# (4159, 2) (2060, 2)

Save the dataset you created.

data_preparation.ipynb


np.save('dev.npy', dev_arr)
np.save('test.npy', test_arr)

The following is a sample of some test data.

1: Rakuten's negligent play that would have been a big flame if it had been defeated by Seibu as it was 0: The fund management balance was 17.9 billion yen (up 8.4% year-on-year) due to an increase in interest on loans, although interest and dividends on securities decreased. 1: I've known him since he was 18 years old, but now he's really grown up as a kid 0: With the establishment of the business succession navigator, we will carry out sales activities to meet the needs of mid-cap (mid-sized companies) managers who want to select the best business succession plan from several options over a relatively long span. I am going 0: Segment profit was 3,977 due to an increase in quality-related expenses due to airbag inflators and various expenses centered on selling, general and administrative expenses due to rising interest rates in the United States, the effects of exchange rate fluctuations, and an increase in test and research expenses. Profit decreased by 146 billion yen (26.8%) from the previous consolidated fiscal year to 100 million yen. 1: Also, since we came, I think there was one time I wanted the director to approve it. 1: And the 1st place was "I wonder if that was the only one that kicked perfectly according to the image. 0: Although the handling of imported air cargo remained firm, sales were 101.7 billion yen, down 13.3 billion yen, or 11.6% from the previous consolidated fiscal year, and operating income was 1.1 billion yen due to the effects of foreign exchange. Profit decreased by 500 million yen, 33.5% compared to the previous consolidated fiscal year 0: In the Japanese economy, while the employment and income environment continued to improve, the economy continued to show a gradual recovery trend, although some delays in improvement were seen. 0: Regarding the PC content distribution business, we will operate paid fan club sites for PCs such as artists and talents, and carry out contract production of official sites, which will lead to new profits in the future, including other business divisions. We have been developing our business with that in mind.

MeCab + NEologd dictionary installation

From here you will work on the notebook ELMo_BERT_embedding.ipynb. First, mount Google Drive in the same way as before.

ELMo_BERT_embedding.ipynb


from google.colab import drive
drive.mount('/content/drive')

The stockmark pre-learning model uses the MeCab + NEologd dictionary as the tokenizer. Follow the Official Distribution Page to install the MeCab + NEologd dictionary. First, execute the following command on the notebook to install MeCab itself.

ELMo_BERT_embedding.ipynb


!apt install aptitude swig
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y

Then install the NEologd dictionary. If you work under the directory'My Drive', you will get an error because the directory name contains spaces, so work in the root directory.

ELMo_BERT_embedding.ipynb


%cd /content
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
%cd mecab-ipadic-neologd
!echo yes | ./bin/install-mecab-ipadic-neologd -n
%cd /content/drive/'My Drive'/anomaly_detection2

Finally, install the library for calling MeCab in python.

ELMo_BERT_embedding.ipynb


!pip install mecab-python3
import MeCab

In the Japanese version of ELMo used this time, it is necessary to enter the token divided by MeCab, so read the dataset created earlier and tokenize it.

ELMo_BERT_embedding.ipynb


%cd /content/drive/'My Drive'/anomaly_detection
dev_arr = np.load('dev.npy')
test_arr = np.load('test.npy')

#Cut out only sentences
dev_texts = dev_arr[:, 1]
test_texts = test_arr[:, 1]

def MeCab_tokenizer(texts):
    mecab = MeCab.Tagger(
        "-Owakati -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd")
    token_list = []
    for s in texts:
        parsed = mecab.parse(s).replace('\n', '')
        token_list += [parsed.split()]
    return token_list

dev_tokens = MeCab_tokenizer(dev_texts)
test_tokens = MeCab_tokenizer(test_texts)

Also, in the BERT program used this time, it is necessary to input from a text file, so write it in the file. This does not need to be tokenized in advance.

ELMo_BERT_embedding.ipynb


with open('dev_text.txt', mode='w') as f:
    f.write('\n'.join(dev_texts))
with open('test_text.txt', mode='w') as f:
    f.write('\n'.join(test_texts))

ELMo The ELMo model uses this implementation (https://github.com/HIT-SCIR/ELMoForManyLangs) with stockmark's pre-trained parameters. The specific usage is

-How to use ELMo (using MeCab) model that learned large-scale Japanese business news corpus and accuracy comparison verification

I referred to. First, install the required libraries and clone the repository.

ELMo_BERT_embedding.ipynb


!pip install overrides
!git clone https://github.com/HIT-SCIR/ELMoForManyLangs.git

Run setup.py to complete the installation.

ELMo_BERT_embedding.ipynb


%cd ./ELMoForManyLangs
!python setup.py install
%cd ..

Then download the stockmark pre-trained model from this link. There are two types of models: "word-based embedded model" and "character-based / word-based embedded model". This time, we will use the "word-based embedding model". I placed the downloaded file in the mounted Google Drive folder ./ELMo_ja_word_level. The contents of the folder should look like this:

ELMo_BERT_embedding.ipynb


!ls ./ELMo_ja_word_level/
# char.dic  config.json  configs  encoder.pkl  token_embedder.pkl  word.dic

The model is now ready. ʻCreate an Embedder` instance and define a function to calculate the embedded vector of the text.

ELMo_BERT_embedding.ipynb


from ELMoForManyLangs.elmoformanylangs import Embedder
from overrides import overrides

elmo_model_path = "./ELMo_ja_word_level"
elmo_embedder = Embedder(elmo_model_path, batch_size=64)

def get_elmo_embeddings(token_list, batch_size=64):
    length = len(token_list)
    n_loop = int(length / batch_size) + 1
    sent_emb = []
    for i in range(n_loop):
        token_emb = elmo_embedder.sents2elmo(
            token_list[batch_size*i: min(batch_size*(i+1), length)])
        for emb in token_emb:
          # sum over tokens to obtain the embedding for a sentence
          sent_emb.append(sum(emb))
    return np.array(sent_emb)

The output of the model is an embedded vector for each input token. The sum of all the token vectors in the text is used as the text embedding vector. The dimension of the ELMo vector is 1024.

ELMo_BERT_embedding.ipynb


dev_elmo_embeddings = get_elmo_embeddings(dev_tokens)
test_elmo_embeddings = get_elmo_embeddings(test_tokens)
print(dev_elmo_embeddings.shape, test_elmo_embeddings.shape)
# (4159, 1024) (2060, 1024)

np.save('dev_elmo_embeddings.npy', dev_elmo_embeddings)
np.save('test_elmo_embeddings.npy', test_elmo_embeddings)

BERT For BERT, we will use the pre-trained model published by Stockmark. I downloaded the TensorFlow version from this Download link and placed it in the folder ./BERT_base_stockmark. The contents of the folder are as follows.

ELMo_BERT_embedding.ipynb


!ls ./BERT_base_stockmark
# bert_config.json  vocab.txt  output_model.ckpt.index  output_model.ckpt.meta output_model.ckpt.data-00000-of-00001 

Import tensorflow by specifying version 1.x series.

ELMo_BERT_embedding.ipynb


%tensorflow_version 1.x
import tensorflow as tf

Clone the model body from the Official Repository.

ELMo_BERT_embedding.ipynb


!git clone https://github.com/google-research/bert.git

The official repository has a script code ʻextract_features.py` for extracting the embedded vector, so if it is the original English version, you can just execute it, but in order to use the Japanese pre-learning model Needs to edit some files.

Change tokenization.py to use MeCab as a tokenizer according to the instructions on how to use this page. First, change the function convert_tokens_to_ids (vocab, tokens) as follows so that it returns 1 which is the id of [UNK] for unknown words.

bert/tokenization.py


def convert_tokens_to_ids(vocab, tokens):
  # modify so that it returns id=1 which means [UNK] when a token is not in vocab
  output = []
  for t in tokens:
    if t in vocab.keys():
      i = vocab[t]
    else:   # if t is [UNK]
      i = 1
    output.append(i)
  return output

In addition, rewrite the class FullTokenizer (object) as follows to use MecabTokenizer instead of WordpieceTokenizer. Other Japanese pre-learned BERTs have MeCab or Human ++ divided into morphemes and then the Wordpiece Tokenizer is applied, but the stockmark model uses only Mecab.

bert/tokenization.py


class FullTokenizer(object):
  """Runs end-to-end tokenziation."""

  def __init__(self, vocab_file, do_lower_case=True):
    self.vocab = load_vocab(vocab_file)
    self.inv_vocab = {v: k for k, v in self.vocab.items()}
    #self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
    #self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
    # use Mecab
    self.mecab_tokenizer = MecabTokenizer()

  def tokenize(self, text):
    split_tokens = []
    # for token in self.basic_tokenizer.tokenize(text):
    # use Mecab
    for token in self.mecab_tokenizer.tokenize(text):
    	split_tokens.append(token)

    return split_tokens

  def convert_tokens_to_ids(self, tokens):
    #return convert_by_vocab(self.vocab, tokens)
    return convert_tokens_to_ids(self.vocab, tokens)

  def convert_ids_to_tokens(self, ids):
    return convert_by_vocab(self.inv_vocab, ids)

Finally, add the class MecabTokenizer that inherits BasicTokenizer.

bert/tokenization.py


class MecabTokenizer(BasicTokenizer):
  def __init__(self):
    import MeCab
    path = "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"
    self._mecab = MeCab.Tagger(path)

  def tokenize(self, text):
    """Tokenizes a piece of text."""
    text = convert_to_unicode(text.replace(' ', ''))
    text = self._clean_text(text)

    mecab_result = self._mecab.parseToNode(text)
    split_tokens = []
    while mecab_result:
    	if mecab_result.feature.split(",")[0] != 'BOS/EOS':
	        split_tokens.append(mecab_result.surface)
    	mecab_result = mecab_result.next

    output_tokens = whitespace_tokenize(" ".join(split_tokens))
    print(split_tokens)
    return output_tokens

Now you are ready to use the Japanese pre-learned BERT, but ʻextract_features.py outputs a file that stores the embedded vectors for all tokens of the entered sentence, so it will be output if there are many sentences. The file size will be large. In this task, we will use only the text embedding vector, not the embedding vector of each token, so we will rewrite ʻextract_features.py to output only that. As the embedding vector of the sentence, the average of the vectors of all tokens is used as in ELMo. Also, for the layer that extracts the embedded vector, the original code outputs the vector of the specified layer respectively, but change it so that the vectors of all the specified layers are averaged. The output is saved in numpy npy format, so in the header part

bert/extract_features.py


import numpy as np

Is added. In addition, change the last part of the main function, ʻinput_fn = input_fn_builder (after the` line, as follows:

bert/extract_features.py


  input_fn = input_fn_builder(
      features=features, seq_length=FLAGS.max_seq_length)

  arr = []
  for result in estimator.predict(input_fn, yield_single_examples=True):
    cnt = 0
    for (i, token) in enumerate(feature.tokens):
      for (j, layer_index) in enumerate(layer_indexes):
        layer_output = result["layer_output_%d" % j]
        if token != '[CLS]' and token != '[SEP]':
          if cnt == 0:
            averaged_emb = np.array(layer_output[i:(i + 1)].flat)
          else:
            averaged_emb += np.array(layer_output[i:(i + 1)].flat)
          cnt += 1
    averaged_emb /= cnt
    arr += [averaged_emb]
  np.save(FLAGS.output_file, arr)

Now that you're ready, run the following command in your notebook ELMo_BERT.ipynb. As the layer for extracting the embedding vector, the latter 5 layers excluding the final layer will be used.

ELMo_BERT_embedding.ipynb


#BERT execution dev
!python ./bert/extract_features.py \
  --input_file=dev_text.txt \
  --output_file=dev_bert_embeddings.npy \
  --vocab_file=./BERT_base_stockmark/vocab.txt \
  --bert_config_file=./BERT_base_stockmark/bert_config.json \
  --init_checkpoint=./BERT_base_stockmark/output_model.ckpt \
  --layers -2,-3,-4,-5,-6

ELMo_BERT_embedding.ipynb


#BERT execution test
!python ./bert/extract_features.py \
  --input_file=test_text.txt \
  --output_file=test_bert_embeddings.npy \
  --vocab_file=./BERT_base_stockmark/vocab.txt \
  --bert_config_file=./BERT_base_stockmark/bert_config.json \
  --init_checkpoint=./BERT_base_stockmark/output_model.ckpt \
  --layers -2,-3,-4,-5,-6

USE USE uses a lightweight version of the model learned in multiple languages including Japanese (Multilingual, CNN version, v3) .. The usage is as explained in Previous article. TensorFlow uses version 2.x series.

USE_embedding.ipynb


import tensorflow_hub as hub
import tensorflow_text
import tensorflow as tf

from google.colab import drive
drive.mount('/content/drive')

USE_embedding.ipynb


%cd /content/drive/'My Drive'/anomaly_detection

use_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual/3'
embed = hub.load(use_url)

def get_use_embeddings(texts, batch_size=100):
    length = len(texts)
    n_loop = int(length / batch_size) + 1
    embeddings = use_embedder(texts[: batch_size])
    for i in range(1, n_loop):
        arr = use_embedder(texts[batch_size*i: min(batch_size*(i+1), length)])
        embeddings = tf.concat([embeddings, arr], axis=0)
    return np.array(embeddings)

USE_embedding.ipynb


dev_use_embeddings = get_use_embeddings(dev_arr[:, 1])
test_use_embeddings = get_use_embeddings(test_arr[:, 1])
print(dev_use_embeddings.shape, test_use_embeddings.shape)
# (4159, 512) (2060, 512)

np.save('dev_use_embeddings.npy', dev_use_embeddings)
np.save('test_use_embeddings.npy', test_use_embeddings)

Anomaly detection model development

Now that we have the embedded vectors for the three models, we will build an anomaly detection model and compare the accuracy. Load development / test data and embedded vector data on the notebook anomaly_detection.ipynb.

anomaly_detection.ipynb


import numpy as np
from scipy.stats import chi2

from google.colab import drive
drive.mount('/content/drive')

anomaly_detection.ipynb


%cd /content/drive/'My Drive'/anomaly_detection
dev_arr = np.load('dev.npy')
test_arr = np.load('test.npy')
dev_elmo_embeddings = np.load('dev_elmo_embeddings.npy')
test_elmo_embeddings = np.load('test_elmo_embeddings.npy')
dev_bert_embeddings = np.load('dev_bert_embeddings.npy')
test_bert_embeddings = np.load('test_bert_embeddings.npy')
dev_use_embeddings = np.load('dev_use_embeddings.npy')
test_use_embeddings = np.load('test_use_embeddings.npy')

print(dev_elmo_embeddings.shape, test_elmo_embeddings.shape)
# (4159, 1024) (2060, 1024)
print(dev_bert_embeddings.shape, test_bert_embeddings.shape)
# (4159, 768) (2060, 768)
print(dev_use_embeddings.shape, test_use_embeddings.shape)
# (4159, 512) (2060, 512)

For details on the anomaly detection model, refer to Previous article. The model code is summarized in the following class DirectionalAnomalyDetection.

anomaly_detection.ipynb


class DirectionalAnomalyDetection:
  def __init__(self, dev_embeddings, test_embeddings, dev_arr):
    self.dev_embeddings = self.normalize_arr(dev_embeddings)
    self.test_embeddings = self.normalize_arr(test_embeddings)
    self.dev_arr = dev_arr
    self.mu, self.anom  = self.calc_anomality(self.dev_embeddings)
    self.mhat, self.shat = self.calc_chi2params(self.anom)
    print("mhat: {:.3f}".format(self.mhat))
    print("shat: {:.3e}".format(self.shat))
    self.anom_test = None

  def normalize_arr(self, arr):  
    norm = np.apply_along_axis(lambda x: np.linalg.norm(x), 1, arr) # norm of each vector
    normed_arr = arr / np.reshape(norm, (-1,1))
    return normed_arr

  def calc_anomality(self, embeddings):
    mu = np.mean(embeddings, axis=0)
    mu /= np.linalg.norm(mu)
    anom = 1 - np.inner(mu, embeddings)
    return mu, anom

  def calc_chi2params(self, anom):
    anom_mean = np.mean(anom)
    anom_mse = np.mean(anom**2) - anom_mean**2
    mhat = 2 * anom_mean**2 / anom_mse
    shat = 0.5 * anom_mse / anom_mean
    return mhat, shat

  def decide_ath_by_alpha(self, alpha, x_ini, max_ite=100, eps=1.e-12):
    # Newton's method
    x = x_ini
    for i in range(max_ite):
      xnew = x - (chi2.cdf(x, self.mhat, loc=0, scale=self.shat) 
            - (1 - alpha)) / chi2.pdf(x, self.mhat, loc=0, scale=self.shat)
      if abs(xnew - x) < eps:
        print("iteration: ", i+1)
        break
      x = xnew
    print("ath: {:.4f}".format(x))
    return x

  def decide_ath_by_labels(self, x_ini, max_ite=100, eps=1.e-12):
    anom0 = self.anom[self.dev_arr[:, 0] == '0.0']
    anom1 = self.anom[self.dev_arr[:, 0] == '1.0']
    mhat0, shat0 = self.calc_chi2params(anom0)
    mhat1, shat1 = self.calc_chi2params(anom1)
    ath = self._find_crossing_point(mhat0, shat0, mhat1, shat1, x_ini, max_ite, eps)
    print("ath: {:.4f}".format(ath))
    return ath

  def _find_crossing_point(self, mhat0, shat0, mhat1, shat1, x_ini, max_ite, eps):
    # Newton's method
    x = x_ini
    for i in range(max_ite):
      xnew = x - self._crossing_func(x, mhat0, shat0, mhat1, shat1)
      if abs(xnew - x) < eps:
        print("iteration: ", i+1)
        break
      x = xnew
    return x

  def _crossing_func(self, x, mhat0, shat0, mhat1, shat1):
    chi2_0 = chi2.pdf(x, mhat0, loc=0, scale=shat0)
    chi2_1 = chi2.pdf(x, mhat1, loc=0, scale=shat1)
    nume = x * (chi2_0 - chi2_1)
    deno = (mhat0 * 0.5 - 1 - x / shat0 * 0.5) * chi2_0 \
          -(mhat1 * 0.5 - 1 - x / shat1 * 0.5) * chi2_1
    return nume / deno

  def predict(self, ath):
    self.anom_test = 1 - np.inner(self.mu, self.test_embeddings)
    predict_arr = (self.anom_test > ath).astype(np.int)
    return predict_arr

Create a model instance for each of the three types of data. The parameters $ \ hat {m} $ and $ \ hat {s} $ when the anomaly distribution is fitted with the chi-square distribution are output.

anomaly_detection.ipynb


#-- ELMo
dad_elmo = DirectionalAnomalyDetection(dev_elmo_embeddings, test_elmo_embeddings, dev_arr)
# mhat: 6.813
# shat: 3.011e-03

#-- BERT
dad_bert = DirectionalAnomalyDetection(dev_bert_embeddings, test_bert_embeddings, dev_arr)
# mhat: 20.358
# shat: 6.912e-03

#-- USE
dad_use = DirectionalAnomalyDetection(dev_use_embeddings, test_use_embeddings, dev_arr)
# mhat: 36.410
# shat: 1.616e-02

The effective dimension $ \ hat {m} $ for any model is much smaller than the actual embedded vector dimension. It is interesting that the larger the dimension of the actual embedded vector (ELMo: 1024, BERT: 768, USE: 512), the smaller the effective dimension. The chi-square distribution with the parameters determined in this way is plotted together with the actual distribution of the degree of anomaly as follows. Reflecting the small anomalous dimension, the peak value of the anomaly distribution is also small in BERT and especially ELMo. It can also be seen that the fit of the chi-square distribution is also relatively poor for ELMo and BERT.

<img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/fadbad99-0e34-1c35-55c2-699617b0f370.png ", height=200><img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/e4c3dee9-2af5-d68a-966c-59ebebf7c52a.png ", height=200><img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/909631e1-a7f4-8eb0-14f1-f3d2414b7b1f.png ", height=200>

Next, find the threshold $ a_ \ text {th} $ for the degree of anomaly. In the previous article, the formula for the pre-determined false alarm rate $ \ alpha $ $$ 1- \ alpha = \ int_0 ^ {a_ \ text {th}} \! dx , \ chi ^ 2 (x | \ hat {m}, \ hat {s}) Threshold $ by solving $$ Determined a_ \ text {th} $. This method is implemented in the method decide_ath_by_alpha of the class defined above. Only the results are shown, but this method results in significantly lower classification accuracy for test data in any of the three models. The value of the false alarm rate is 0.01, which is the same as the previous time.

Accuracy Precision Recall F1
ELMo 0.568 0.973 0.141 0.246
BERT 0.580 0.946 0.170 0.288
USE 0.577 0.994 0.155 0.269

To see why it is so inaccurate, plot the outlier histogram for the test data separately for normal and outlier data. The mountain on the left is normal data, and the mountain on the right is abnormal data. The red vertical line represents the threshold $ a_ \ text {th} $ obtained by decide_ath_by_alpha. Only the results for USE are shown here, but the results for other models are similar. You can see that the normal data and the abnormal data are separated to some extent, but most of the data is classified as normal data because the threshold value is too large. <img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/800c812d-897d-8472-bac2-0f6bf5431e97.png ", height=240> In the results shown in the previous article, the chi-square distribution fits well with respect to the distribution of the degree of anomaly, and the distribution of normal data and the distribution of anomalous data are fairly clearly separated. Was obtained. However, for this data, it did not work because the overlap between the normal data and the abnormal data distribution is relatively large.

Now, I would like to determine the threshold $ a_ \ text {th} $ by another method, but there seems to be no further way in the setting of unsupervised learning. Therefore, we will set the threshold value in the setting of supervised learning that the normal / abnormal flag of the development data is known. Specifically, take the following steps.

    1. The degree of abnormality of development data is divided into those for normal data and those for abnormal data.
  1. Fit each of the two distributions of anomaly with a chi-square distribution.
  2. Let the intersection of the two chi-square distributions be the threshold of the degree of anomaly. The figure below shows the result of taking this procedure for the anomaly degree calculated by USE. The solid black and blue lines represent the two chi-square distributions, and the dotted red line represents the threshold. <img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/d05ab151-a0a6-b681-8769-9704e6c9c536.png ", height=240> This procedure for finding the threshold is implemented in the method decide_ath_by_labels of the class DirectionalAnomalyDetection, so execute it as follows.

anomaly_detection.ipynb


#-- ELMo
ath_elmo = dad_elmo.decide_ath_by_labels(x_ini=0.02)
# iteration:  5
# ath: 0.0312

#-- BERT
ath_bert = dad_bert.decide_ath_by_labels(x_ini=0.2)
# iteration:  6
# ath: 0.1740

#-- USE
ath_use = dad_use.decide_ath_by_labels(x_ini=0.8)
# iteration:  5
# ath: 0.7924

Predict the test data using the obtained threshold value and evaluate the accuracy.

anomaly_detection.ipynb


#Forecast
predict_elmo = dad_elmo.predict(ath_elmo)
predict_bert = dad_bert.predict(ath_bert)
predict_use = dad_use.predict(ath_use)

#Correct answer data
answer = test_arr[:, 0].astype(np.float)

anomaly_detection.ipynb


#Function for accuracy evaluation
def calc_precision(answer, predict, title, export_fig=False):
  acc = accuracy_score(answer, predict)
  precision = precision_score(answer, predict)
  recall = recall_score(answer, predict)
  f1 = f1_score(answer, predict)
  cm = confusion_matrix(answer, predict)
  print("Accuracy: {:.3f}, Precision: {:.3f}, Recall: {:.3f}, \
        F1: {:.3f}".format(acc, precision, recall, f1))

  plt.rcParams["font.size"] = 18
  cmd = ConfusionMatrixDisplay(cm, display_labels=[0,1])
  cmd.plot(cmap=plt.cm.Blues, values_format='d')
  plt.title(title)
  if export_fig:
      plt.savefig("./cm_" + title + ".png ", bbox_inches='tight')
  plt.show()
  return [acc, precision, recall, f1]

anomaly_detection.ipynb


#Accuracy evaluation
_ = calc_precision(answer, predict_elmo, title='ELMo', export_fig=True)
_ = calc_precision(answer, predict_bert2, title='BERT', export_fig=True)
_ = calc_precision(answer, predict_use, title='USE', export_fig=True)

The results are as follows.

Accuracy Precision Recall F1
ELMo 0.916 0.910 0.923 0.917
BERT 0.851 0.847 0.858 0.852
USE 0.946 0.931 0.963 0.947

Confusion matrix <img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/ec194af5-3dc1-836f-0739-18e3dceb6529.png ", height=200><img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/a539e4d5-161d-2c87-5fc0-5da16b9780ee.png ", height=200><img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/70faf102-db47-c39f-4908-d8dc529cad3f.png ", height=200>

By using the thresholds determined in the supervised learning settings, the accuracy of any model is greatly improved. Comparing the three models, the accuracy of USE is the highest in all indicators, and conversely the accuracy of BERT is the lowest.

Finally, plot the degree of abnormality of the test data for each model separately for normal data and abnormal data. <img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/98cee787-b773-e0fe-b10c-09262fc9440c.png ", height=200><img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/c57dbbdc-4b5e-06cf-d542-b6b35c5344e5.png ", height=200><img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/fe357b32-518e-58b8-07eb-fd1e2c74bb70.png ", height=200> The vertical dotted red line represents the threshold value obtained using the correct label. It is a threshold value determined from the development data that contains only 1% of abnormal data, but it can be seen that it is certainly located at the intersection of the distribution of normal data and abnormal data. Regarding BERT, it can be seen that the distribution of normal data and abnormal data overlaps greatly, and therefore the accuracy is the lowest.

in conclusion

Since the threshold of the degree of anomaly could not be set properly in the setting of unsupervised learning as in the previous article, we created an anomaly detection model with supervised learning. Even with supervised learning, the correct label was only used to determine the threshold of the degree of anomaly, so it is the quality of the encoder model that determines the accuracy, that is, how well the distribution of the degree of anomaly of normal data and abnormal data is separated. It depended on what I could do. As a result, the most accurate encoder model was the USE. Is the task related to cosine similarity still what USE is good at?

Recommended Posts

Use ELMo, BERT, USE to detect anomalies in sentences
Use MeCab to translate sloppy sentences in a "slow" way.
How to use classes in Theano
Mock in python-how to use mox
How to use SQLite in Python
How to use ChemSpider in Python
How to use PubChem in Python
Use ELMo and BERT to determine word similarity for polysemous words
How to use calculated columns in CASTable
[Introduction to Python] How to use class in Python?
How to use Google Test in C
Easy way to use Wikipedia in Python
Minimum knowledge to use Form in Flask
How to use Anaconda interpreter in PyCharm
How to use __slots__ in Python class
How to use regular expressions in Python
How to use Map in Android ViewPager
How to use is and == in Python
How to use the C library in Python
Attempt to detect English spelling mistakes in python
How to use Python Image Library in python3 series
EP 11 Use `zip` to Process Iterators in Parallel
Use cryptography module to handle OpenSSL in Python
How to use tkinter with python in pyenv
Use pygogo to get the log in json.
Use os.getenv to get environment variables in Python
[Implementation explanation] How to use the Japanese version of BERT in Google Colaboratory (PyTorch)
[For beginners] How to use say command in python!
I want to use self in Backpropagation (tf.custom_gradient) (tensorflow)
How to use bootstrap in Django generic class view
How to use template engine in pyramid 1 file application
How to use the exists clause in Django's queryset
How to use variables in systemd Unit definition files
Convenient to use matplotlib subplots in a for statement
I tried to summarize how to use pandas in python
How to use the model learned in Lobe in Python
How to use Decorator in Django and how to make it
Use date to x-axis of tsplot depicted in seaborn
How to use Spacy Japanese model in Google Colaboratory
I want to use the R dataset in python