[PYTHON] Anomaly detection by autoencoder using keras [Implementation example for beginners]

What I did in this article

**-Challenge to implement autoencoder with keras --Implemented anomaly detection by unsupervised learning --Evaluate the effect from the recall rate and precision rate **

Introduction

Unsupervised learning is generally less accurate than supervised learning, but at the cost of many benefits. Specifically, as a scene where unsupervised learning is useful

**-Data whose pattern is not well understood --Time-varying data --Unlabeled data **

And so on.

Unsupervised learning learns ** the structure behind the data ** from the data itself. This allows you to take advantage of more unlabeled data, which may pave the way for new applications.

So, this time, I will introduce ** anomaly detection method using unsupervised learning **. There are various types of anomaly detection, but here we will introduce the code for ** anomaly detection when using an autoencoder **.

Data and code ["Unsupervised learning with python"](url https://www.amazon.co.jp/Python%E3%81%A7%E3%81%AF%E3%81%98%E3%82 % 81% E3% 82% 8B% E6% 95% 99% E5% B8% AB% E3% 81% AA% E3% 81% 97% E5% AD% A6% E7% BF% 92-% E2% 80% 95% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 81% AE% E5% 8F% AF% E8% 83% BD% E6% 80% A7% E3% 82% 92% E5% BA% 83% E3% 81% 92% E3% 82% 8B% E3% 83% A9% E3% 83% 99% E3% 83% AB% E3% 81% AA% E3% 81% 97% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E5% 88% A9% E7% 94% A8-Ankur-Patel / dp / 4873119103) I am allowed to.

Data to handle

Use a data set for credit card fraud detection. It looks like the data originally used by kaggle.

You can download the data from the following. https://github.com/aapatel09/handson-unsupervised-learning/blob/master/datasets/credit_card_data/credit_card.csv

Library import

Since it is a reference book, it contains some unnecessary items.

`python`


'''Main'''
import numpy as np
import pandas as pd
import os, time, re
import pickle, gzip

'''Data Viz'''
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
import matplotlib as mpl

%matplotlib inline

'''Data Prep and Model Evaluation'''
from sklearn import preprocessing as pp
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import StratifiedKFold 
from sklearn.metrics import log_loss
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import roc_curve, auc, roc_auc_score

'''TensorFlow and Keras'''
import tensorflow as tf
import keras
from keras import backend as K
from keras.models import Sequential, Model
from keras.layers import Activation, Dense, Dropout
from keras.layers import BatchNormalization, Input, Lambda
from keras import regularizers
from keras.losses import mse, binary_crossentropy

sns.set("talk")

Download data

`python`



data = pd.read_csv("credit_card.csv")
dataX = data.copy().drop(["Class","Time"],axis=1)
dataY = data["Class"].copy()

print("dataX shape:{},dataY shape:{}".format(dataX.shape,dataY.shape))
dataX.head()

I think that the data will be output like this. There should actually be 29 columns.

--dataX contains data representing the use of 284807 cards --dataY has data on whether 284807 people went bankrupt

It means that. You can't tell from this data just what kind of usage the card represents.

Overview of anomaly detection method

Next, I will introduce how to detect anomalies. According to the above data, about 0.2% of users are being abused. In other words, most of the data is used normally.

Therefore, I think that the data structure of this ** abused data is different from that of normal data **. This is an attempt to reveal this difference in data structure using an autoencoder.

So how can the autoencoder burn out abnormal data? What is an autoencoder? As shown below, ** once compressed into data with a reduced number of dimensions, and then reconstructed into data with the same number of dimensions again. ** **

The reconstructed data usually has the same values as the original data. However, ** abnormal data will be different from normal data and will be different from the original data **.

Whether it is abnormal data or not is determined by taking the error between the original data and the reconstructed data.

At first I wasn't very clean, but when I think about it, it's insanely rational. However, as a caveat, this method assumes that ** the number of abnormal data is very small compared to normal data **. This is because if there is a lot of abnormal data, the reconstruction of normal data will not work.

Creation of training data and evaluation data

`python`



#Scale conversion so that the average of all features is 0 and the standard deviation is 1.
featuresToScale = dataX.columns
sX = pp.StandardScaler(copy=True, with_mean=True, with_std=True)
dataX.loc[:,featuresToScale] = sX.fit_transform(dataX[featuresToScale])

#Divide into training data and test data
X_train, X_test, y_train, y_test = \
  train_test_split(dataX,dataY,test_size=0.33,random_state=2018,stratify=dataY)

X_train_AE = X_train.copy()
X_test_AE = X_test.copy()

Implementation of autoencoder

Next is how to implement the autoencoder.

Here, we will create an autoencoder ** that uses a two-layer linear activation function. ** The difference is that the correct answer data is not used, so the correct answer data is the same as the learning data. ** **

`python`


#Construction of an autoencoder using a two-layer linear activation function

model = Sequential()
model.add(Dense(units=27,activation="linear",input_dim=29)) #Determine how many layers the data should be condensed in units.
model.add(Dense(units=29,activation="linear"))

model.compile(optimizer="adam",loss="mean_squared_error",metrics="accuracy")
num_epochs = 3
batch_size = 32

history = model.fit(x=X_train_AE,y=X_train_AE,
                    epochs=num_epochs,
                    batch_size=batch_size,
                    shuffle=True,
                    validation_data=(X_train_AE,X_train_AE),
                    verbose=1)

Creating merit function

At this point, the autoencoder is complete. Then to evaluate this autoencoder

**-A function that calculates the error between the original data and the bested data --Compliance rate-Function to draw recall curve, AUC curve **

Create a.

The first is a function that calculates the reconstruction error. This is just a square error, so it doesn't seem to be that difficult.

`python`



#Anomalous score function that calculates the highest error between the original feature and the newly reconstructed feature matrix
#Calculate and normalize the sum of squared errors to be between 0 and 1
#Close to 1 is abnormal, close to 0 is normal
def anomalyScores(originalDF,reduceDF):
  loss = np.sum((np.array(originalDF)-np.array(reduceDF))**2,axis=1)
  loss = pd.Series(data=loss,index=originalDF.index)
  loss = (loss-np.min(loss))/(np.max(loss)-np.min(loss))
  return loss

Next is the function that draws the precision-recall rate curve and the AUC curve.

`python`


#Compliance rate-再現率曲線、平均Compliance rate、auROC曲線をプロットする
def plotResults(trueLabels, anomalyScores,returnPreds=False):
  preds = pd.concat([trueLabels,anomalyScores],axis=1)
  preds.columns = ["trueLabel","anomalyScore"]

  #Conformity rate at each threshold(precision)And recall(recall)Calculate
  precision, recall, thresholds = precision_recall_curve(preds["trueLabel"],preds["anomalyScore"])
  average_precision = average_precision_score(preds["trueLabel"],preds["anomalyScore"])

  #Compliance rate-Recall rate curve
  plt.step(recall,precision,color="k",alpha=0.7,where="post")
  plt.fill_between(recall,precision,step="post",alpha=0.3,color="k")

  plt.xlabel("Recall")
  plt.ylabel("Precision")
  plt.ylim([0,1.05])
  plt.xlim([0,1.0])

  plt.title("Precision-Recall Curve:Average Precision={0:0.2f}".format(average_precision))

  fpr,tpr,thresholds = roc_curve(preds["trueLabel"],preds["anomalyScore"])
  areaUnderROC = auc(fpr,tpr)
 
 #AUC curve
  plt.figure()
  plt.plot(fpr,tpr,color="r",lw=2,label="ROC curve")
  plt.plot([0,1],[0,1],color="k",lw=2,linestyle="--")
  plt.xlabel("False positive Rate")
  plt.ylabel("True Postive Rate")
  plt.ylim([0,1.05])
  plt.xlim([0,1.0])

  plt.title("Receiver operating characteristic: Area under the curve = {0:0.2f}".format(areaUnderROC))
  plt.legend(loc="lower right")
  plt.show()

  if returnPreds == True:
    return preds

This function creates a dataframe that stores the labels of the correct answer data and the outliers. The key functions are ** precision_recall_curve and roc_curve **.

** precision_recall_curve is a function that calculates the precision and recall when the threshold is changed from 1 to 0 **.

For example, when the threshold is 0, anything above 0 is considered an error. In other words, everything is judged as an error. When this happens, the precision is 0 and the recall is 1. This is a convenient function that calculates this for each threshold.

** roc_curve calculates the False Positive Rate and the True Positive Rate for each threshold **.

For example, when the threshold is 0, all 0 and above are judged as positive. Then the true positive rate will be 1 because there are 0 false negatives. Also, the false positive rate is 1 because there are no true negatives. It is a convenient function that determines the ratio at various thresholds like this.

Evaluation result of abnormality detection by autoencoder

Then, it is the evaluation result of the autoencoder that was actually implemented earlier.

First of all, from the graph of precision rate-recall rate. When the recall rate on the horizontal axis is 75%, the precision value on the vertical axis is about 60%. This means that ** 75% of the fraudulent use can be captured, and 60% of the fraudulent use captured in this way is actually fraudulent data **.

Next is the au ROC curve. It is an index to increase the true positive rate while keeping the false positive rate low. The index at this time was 0.92.

From these results, it was found that ** classification is possible to some extent without learning using correct answer data **. For learning to further improve accuracy

--Introduce dropout --Change the activation function --Change the number of nodes when compressing

It will be considered.

At the end

The strengths of unsupervised learning are that it can be learned without a large amount of label data and that it is flexible to changes in the data.

Personally, I find it interesting to understand the structure behind the data by machine learning. How on earth are you learning ... It's a world that is difficult to imagine and verbalize.

In addition to identifying other things, the book also introduces ways to actually generate the data. (Limited Boltzmann machine, deep learning, GAN, etc.)

I hope it will be helpful for everyone's learning.