[PYTHON] Introduction to Anomaly Detection 1 Basics

Aidemy 2020/11/10

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the first post of the introduction to detection. Nice to meet you.

What to learn this time ・ About anomaly detection ・ About Hotelling method and Mahalanobis distance ・ About the naive Bayes method

About anomaly detection

What is anomaly detection?

-__ Anomaly detection ___ is, as the name suggests, __ catching abnormal data __. Specifically, it is widely used for detecting abnormalities in patients in the medical field and for detecting system failures. ・ In this unit, we aim at __ "Understanding the theory of anomaly detection" __ and __ "Implementation of a simple anomaly detection system" __.

Abnormal pattern

-Look at the patterns of "abnormalities" that should be detected.

Outliers

-The first is __ "outliers" __. Although it has appeared several times before, outliers are __ "values that are far apart from other sets" __. It is easy to understand if it is illustrated.

・ Figure![Screenshot 2020-11-02 17.27.04.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/2755124d-bcd2-2987- 4a35-ccbd77acea87.png)

-The problem of detecting outliers is called __ "outlier detection" __. Details will be dealt with in Chapter 2.

Changing point

-The second is __ "change point" __. The point of change is __ "things whose values behave differently from others" __. As shown in the figure below, if the __ value rises sharply from a certain point, that point is the point of change.

・ Figure スクリーンショット 2020-11-02 17.31.12.png

-The problem of detecting a change point is called __ "change (point) detection" __ or __ "abnormal part detection" __. Details will be dealt with in Chapter 3.

Indicators for anomaly detection

-__ "Abnormality" __ is a quantitative indication of how abnormal the __ value is. Basically, it can be said that the higher the degree of abnormality, the more abnormal the value. -Also, the point __ that is the boundary between __abnormal and normal is called __ "threshold" __. -Everything to be done in the subsequent abnormality detection is the flow of __ "definition of abnormality degree → determination of threshold value" __. When determining this threshold, __ "probability to be regarded as abnormal" __ is often defined, and this probability is called __ "misinformation rate" __.

・ __ "(Number of samples that are actually normal) / (Number of samples that are actually normal)" __ "Correct answer rate (normal sample accuracy) ) ”__, and what is represented by “ 1-normal rate ” is __ false alarm rate __. In other words, the false alarm rate is __ "the ratio of those that are actually normal and those that have been determined to be abnormal" __. -These values are also used for the __accuracy evaluation of the detector.

・ Code to calculate the false alarm rate スクリーンショット 2020-11-02 18.29.37.png

Hotelling method

What is the hoteling method?

-The __hotelling method __ is one of the __outlier detection methods __. As mentioned above, the degree of anomaly and the threshold are defined. -The data to be handled is effectively applied only when the conditions of __ "generated from a single normal distribution" and "almost no abnormal values" __ are satisfied.

・ The flow of the hoteling method is as follows. ① Set the .__ false alarm rate __ by yourself and calculate the __threshold __. ②. Calculate the __mean value __ and __ covariance matrix __ of the values that can be said to be normal. ③. Calculate the __abnormality __ of the test data, and if it exceeds the __threshold, judge it as abnormal __.

・ The above is the flow, but __ data is considered to be almost normal __. Also, the hoteling method can be said to be __unsupervised learning __.

Mahalanobis distance

・ In the flow of the hoteling method ③, the degree of abnormality is calculated, but the index used at this time is __ "Mahalanobis distance" __. This distance is __ "distance between some data and the average of the whole data" __. -When calculating the normal distance, __ "Euclidean distance" __ is used, but for the degree of anomaly, __ "the value changes greatly depending on the size of the variance (scale)" "the correlation between variables cannot be reflected" _ __ Cannot be used because of _. -On the other hand, the Mahalanobis distance makes it possible to consider the __variance by __normalizing __ with the __inverse covariance matrix __. Specifically, the variance is considered by strengthening the influence of the small variance and weakening the influence of the large variance. -Also, you can correct the correlation between each feature by normalization __.

-The code is as follows (details will be described later to do the same thing in ②).

スクリーンショット 2020-11-02 21.01.39.png

①. Set the false alarm rate yourself and calculate the threshold value accordingly.

・ From here, we will actually implement the hoteling method. __ Detecting outliers by the hoteling method __ is called __ "T-squares" __. -At first, __the threshold is calculated __, but you need to set the __ "misinformation rate" required for the calculation yourself __. If the false alarm rate is set high, more abnormalities can be detected, and if it is set low, more normal data can be left. In this way, there is a trade-off between eliminating __abnormal data and leaving normal data __, so it is important to set the value according to the case. Generally, values such as 0.05 and 0.01 are used.

-When erroneous information is set and the amount of data is sufficient, the threshold value is set using the method __ "χ (chi-square test" __). -Specifically, when the false alarm rate is set to 0.05, this test tests as __ "It is abnormal because it is a rare value that occurs at less than 0.05% under normal conditions."

-Implementation is done with __ "st.chi2.ppf ()" __. Pass __ "1-False alarm rate" __ as the first argument, and __ dimension number (number of variables) __ as the second argument.

・ Code (Abnormality a is calculated by Mahalanobis distance, but omitted this time) スクリーンショット 2020-11-02 21.35.27.png

②. Calculate the mean value and covariance matrix of normal values

・ In order to calculate outlier in ③, it is necessary to obtain the above-mentioned Mahalanobis distance, so __ "mean" __ and __ "covariance matrix" required for this calculation. Ask for __. ・ The average is calculated by __np.mean (data, axis = 0) __. (Calculated for each column with axis = 0) -The covariance matrix of the data is calculated by __np.cov (data.T) __.

・ Code![Screenshot 2020-11-02 21.51.42.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/42a59e29-060e-420e- f9bf-a79bd4ffdcdf.png)

③. Calculate the degree of abnormality of the test data, and judge that it is abnormal if it exceeds the threshold value.

-Finally, put __data __ (x), __mean __ (mean) and inverse covariance matrix __ ( np.linalg.pinv (cov) __) in __distance.mahalanobis () __ It can be calculated by passing it. -Compare this result with the __threshold calculated in ① to detect an abnormality.

・ Code![Screenshot 2020-11-02 22.07.29.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/4fec9fec-1ca8-52cb- 0fa2-01002f1a9729.png)

Practice of hoteling method

・ Perform steps ① to ③ above. The code is as follows. -(Supplement) When calculating an abnormal value of each data, it is necessary to extract the element of X_test with the __for statement __. Further, in the "comparison between the threshold value and the abnormal value" which was not performed in the previous section, if the abnormal value is larger than the threshold value, it is regarded as "abnormal", and if it is less than the threshold value, it is regarded as "normal".

スクリーンショット 2020-11-02 23.20.46.png

・ Results (red is outlier (abnormal), blue is normal) スクリーンショット 2020-11-02 23.21.25.png

Naive Bayes method

What is the naive Bayes method?

-Anomaly detection methods including the hoteling method have the drawback of __ "When the number of variables increases (the number of dimensions increases), the amount of calculation increases and it becomes too complicated" __. To solve this, the __ "simple Bayes method" __ is used, which is a __ method that makes the __ multivariable problem a one-variable (one-dimensional) problem. -The basic idea of the naive Bayes method is __ "Because there is no correlation between variables, the probability that multiple events will occur is expressed by the product of each probability" __. For example, in the problem of throwing a coin twice, the probability that the coin will appear twice is expressed as "1/2 * 1/2". By applying this, the probability of whether the data is abnormal or normal can be obtained by the product of the probability (weighted) of each variable being abnormal. -That is, in the naive Bayes method, __abnormality can be calculated by the inner product of data and weight (while setting the dimension to 1). Also, the __threshold is not calculated like the __hotelling method, but is calculated by __optimizing using the verification data. -The condition that the naive Bayes method can be used is __ "The variable of the data is an integer of 0 or more" __. ・ The flow is as follows. ① Calculate weights from training data using the Bayesian method __ ② Optimize the __threshold value from the evaluation data __ ③ Calculate the degree of abnormality and compare it with the threshold value

-For the above flow, in Chapter 2 and later, we will look at the problem that the simple Bayes method is used __ "Detection of abnormal values in documents" __.

Prior knowledge of the naive Bayes method

-Not limited to anomaly detection, in data processing, there is an idea that __ "When considering data as a vector, if the inner product with an appropriate vector is taken, a numerical value representing the characteristics of the data can be obtained" __ .. The "appropriate vector" at this time is called __ "weight vector" __. -Even with the naive Bayes method, when finding an outlier in (3), it can be said that the outlier can be easily found by knowing the weight vector and then taking the inner product of that and the data. Therefore, __① calculates this weight vector __.

・ Regarding the "detection of abnormal values in documents" performed this time, considering something like spam judgment of emails, morphological analysis of emails (documents) is performed to find out the number of data that can be used for identification such as nouns, and this and spam. Do that, pass the label of whether or not to create the weight. (That is __supervised learning __) -After that, the newly passed document is morphologically analyzed in the same way, and the abnormal value is calculated by multiplying the number of occurrences and the created weight.

-In addition, the data for detecting abnormal values in the document indicates the frequency of appearance as __ "word bagging expression" __. Specifically, see the code below. This means that "hoge" and "foo" did not appear, and "bar", "po", and "do" appeared 3, 4, and 1 times, respectively.

スクリーンショット 2020-11-03 14.19.17.png

① Calculate the weight from the training data by the Bayesian method

-__ Weight calculation __ is __ "np" when the weight of normal data (X0) is "w0" and the weight of abnormal data (X1) is "w1" in the training data "X_train". It can be calculated by ".log (w1 / w0)" __. -Therefore, it is necessary to find __w0 and w1 respectively. __ "(Total number of appearances of each word in all normal data) ÷ (Total number of appearance words in all normal data)" __, but the weight should not be 0 when taking the logarithm in np.log , __ Add "alpha = 1" __. This is called __ "geta geta" __. -The total number of occurrences of each word is calculated by __ "np.sum (X0, axis = 0)" __ (for normal data). The total number of words is calculated by __ "np.sum (X0)" __.

-Code![Screenshot 2020-11-03 15.00.37.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/aaae2407-1169-6915- 8c1d-5e475662ef9e.png)

・ Result (only part)![Screenshot 2020-11-03 15.01.02.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/d12ccefe -4b34-d290-260d-60cc82c2744e.png)

② Optimize the threshold from the evaluation data

・ The weight calculation of ① was performed using the training data (X_train), but the threshold optimization was performed using the evaluation data __ "X_valid" __. -Optimize the threshold value with __ "metrics.roc_curve ()" __. The label of the evaluation data __ "y_valid" __ may be passed to the first argument, and the degree of abnormality __ of the evaluation data may be passed to the second argument. -The degree of abnormality of the evaluation data will be described in the next section, but it can be calculated by __ "np.dot ()" __ because the inner product of the data and the weight can be taken as described above. -In addition, "metrics.roc_curve ()" returns three variables, __ "false positive rate", "true positive rate", and "threshold candidate" __, so the variables __ "fpr" "tpr" "thr_arr" _ Store in _. -For these three variables, the threshold can be calculated by setting __ "thr_arr [(tpr-fpr) .argmax ()]" __.

・ Code![Screenshot 2020-11-03 16.14.13.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/83c470ec-7235-3cd6- a7f0-97573bd33d21.png)

③ Calculate the degree of abnormality and compare it with the threshold value

-Finally, as mentioned in the previous section, the degree of abnormality is calculated by __ "np.dot ()" __ and compared with the threshold value. (Code omitted)

Summary

-Abnormal patterns include __ "outliers" and "change points" __. Detecting this is __ anomaly detection __. -One of the outlier detection methods is __ "hotelling method" __. In the hoteling method, the __threshold __, which is the boundary between "abnormal or normal", is first determined, and then the degree of anomaly is calculated from the __mean of the data and the (inverse) covariance matrix __. By comparing this threshold value with the degree of abnormality, it is determined whether it is abnormal or normal. -Also, since this method cannot be used when the data is multidimensional, anomaly detection is made possible by treating it as one-dimensional data using __ "Simple Bayes method" __. -In the naive Bayes method, the degree of anomaly is calculated by calculating the __weight vector __, and then calculating the inner product of this and __ data, and by determining the threshold value and comparing the two, it is abnormal. Determine if it is normal or normal.

This time is over. Thank you for reading until the end.

Recommended Posts

Introduction to Anomaly Detection 1 Basics
Anomaly detection introduction 2 Outlier detection
Anomaly detection introduction 3 Change point detection
A light introduction to object detection
Anomaly detection introduction and method summary
Introduction to MQTT (Introduction)
Introduction to Scrapy (3)
Introduction to Supervisor
Introduction to Tkinter 1: Introduction
Introduction to PyQt
Introduction to Scrapy (2)
[Linux] Introduction to Linux
Introduction to Scrapy (4)
Introduction to discord.py (2)
Introduction to discord.py
[Introduction to Data Scientists] Basics of Python ♬
[Introduction to cx_Oracle] (Part 3) Basics of Table Reference
Introduction to Lightning pytorch
Introduction to Web Scraping
Introduction to Nonparametric Bayes
Introduction to EV3 / MicroPython
Introduction to Python language
Introduction to TensorFlow-Image Recognition
Introduction to OpenCV (python)-(2)
Introduction to PyQt4 Part 1
Introduction to Dependency Injection
Introduction to Private Chainer
PyTorch Super Introduction PyTorch Basics
Introduction to machine learning
[Introduction to cx_Oracle] (Part 11) Basics of PL / SQL Execution
AOJ Introduction to Programming Topic # 1, Topic # 2, Topic # 3, Topic # 4
Introduction to electronic paper modules
A quick introduction to pytest-mock
Introduction to Monte Carlo Method
[Learning memorandum] Introduction to vim
Introduction to PyTorch (1) Automatic differentiation
opencv-python Introduction to image processing
Introduction to Python Django (2) Win
Introduction to Cython Writing [Notes]
An introduction to private TensorFlow
I tried to implement anomaly detection by sparse structure learning
Kubernetes Scheduler Introduction to Homebrew
An introduction to machine learning
[Introduction to cx_Oracle] Overview of cx_Oracle
XPath Basics (2) -How to write XPath
A super introduction to Linux
AOJ Introduction to Programming Topic # 7, Topic # 8
[Introduction to pytorch-lightning] First Lit ♬
Introduction to RDB with sqlalchemy Ⅰ
[Introduction to Systre] Fibonacci Retracement ♬
Introduction to Nonlinear Optimization (I)
Introduction to serial communication [Python]
Basic flow of anomaly detection
AOJ Introduction to Programming Topic # 5, Topic # 6
[Introduction to Data Scientists] Basics of Python ♬ Functions and classes
Introduction to Deep Learning ~ Learning Rules ~
[Introduction to Python] <list> [edit: 2020/02/22]
Introduction to Python (Python version APG4b)
An introduction to Python Programming
[Introduction to cx_Oracle] (8th) cx_Oracle 8.0 release
Introduction to discord.py (3) Using voice