[PYTHON] Time series data anomaly detection for beginners

Introduction

This article was posted as Day 6 of Cisco Advent Calendar 2019 between Cisco Systems LLC.

Seven years have passed since the SuperVision team at the University of Toronto, Canada won the "Image Net Large Scale Visual Recognition Challenge (ILSVRC) 2012" in 2012. At this time, by using a model using deep learning, many modeling in machine learning and deep learning is performed in the third AI boom, and nowadays, it is common knowledge that AI is used in machine learning and deep learning. It has become.

However, according to a survey by the Ministry of Internal Affairs and Communications released in 2017, only 14.1% of companies have actually introduced AI solutions, and 22.8% are considering introducing AI solutions. The number of companies is 36.9% including the examination stage. (Quote: AI / IoT introduction status and schedule Ministry of Internal Affairs and Communications)

There are so many open source services related to machine learning that they are easy to use on a small scale. In the meantime, over the last two years, I've written an example of how to use machine learning easily. 2017 article: I tried to machine learn the location information obtained using API 2018 article: Collecting machine learning teacher data using collaboration tools

Therefore, this time, I would like to introduce an example that makes it easy to model machine learning using time-series data that will be acquired the most as data. Among them, we will focus on how machine learning is applied in the field of anomaly detection.

Data in chronological order

Time-series data refers to all data that is observed over time. And the order in which they are observed is significant. When it comes to time series data related to Cisco products, Syslog data is also included in the time series data, and Stealthwatch There are a wide variety of data such as Flow data of .html) and connection destination data of Cloudlock.

When applying time-series data to machine learning / deep learning, the following flow is often taken. 4.png In the method based on the similarity, how to extract the features, set the window size of the time series data (how to divide the time series set where the features are likely to appear), and how to model in the later machine learning method. The accuracy will vary greatly depending on the choice.

In the model-based method, the method used for metric time series analysis is first matched. The methods used for metric time series analysis here refer to hidden Markov models, ARMA models, VAR models, SARIMA models, and so on. It is a model that regards state transitions as changes over time and represents data transitions. Since each model has parameters in these methods, the obtained parameters are used and applied to the machine learning method.

スクリーンショット 2019-12-06 18.31.24.png Figure 1: Cisco Web Security Appliance Time Series Data スクリーンショット 2019-12-06 18.32.00.png Figure 2: Cisco Cloudlock Time Series Data

This time, we aim to "find anomalous data by comparing with normal data" for time series data obtained as numerical data.

Anomaly detection

I recognize that anomaly detection is a very difficult field where mathematical formulas are lined up, such as statistics, probability theory, and optimization theory, but there are many fields that can be applied. --Discovery of abnormal signs at the factory --Discovering problems such as Malware in the security field --Discovery of health-related abnormalities using human body measurement data Various fields can be considered. Being able to detect anomalies in these fields has the advantage of being able to deal with problems more quickly than humans can detect them.

In the field of anomaly detection, many machine learning methods have been proposed for their application, but here are some of them.

I personally recognize that there are two important points in the field of anomaly detection. The idea is to emphasize accuracy (especially the idea of reducing False-Negative) and the idea of emphasizing the speed to detection. This is an important idea because it also affects how you actually operate it.

1. Concept that emphasizes the accuracy of anomaly detection

The idea of emphasizing accuracy is especially common in the field of failure detection. This is because we want to avoid `` `, which is actually broken, but is predicted to be" not broken ". In terms of machine learning, the idea is to increase recall.

(Reference) What is the recall rate? When considering accuracy, consider the following mixing matrix. 5.png Figure 3: Mixed matrix

And `` recall'' is the precision expressed by the following formula. 6.png In other words, the recall rate is an index of "how much the model can judge as abnormal when it is actually abnormal".

In reality, it is difficult to operate without human involvement in determining whether equipment such as factories is out of order. However, if the number of False-Negatives is large, it is expected that the operation will not change before and after the introduction due to the low trust in AI applications, and the burden on the administrator will not be reduced. To avoid this situation, AI applications that reduce the number of False-Negatives and reduce the burden of failure detection are expected.

2. The idea of emphasizing the speed of detection

When using machine learning / deep learning, you must also consider the calculation cost. No matter how accurate the model is, if it takes a long time to detect, there is no point in deploying an AI application.

For example, a security AI application. No matter how unusual traffic is detected, if it is detected the next day, it makes no sense after the important information has already been stolen. In other words, unless you can spend a lot of money on a base such as a server, you need to consider the calculation cost of the model and emphasize the speed of detection.

Machine learning methods with low computational costs include methods such as the naive Bayes classifier and the k-means / k-medoids method. I will not explain the detailed method here, but there are many other methods used for anomaly detection.

Application example

Here, I would like to consider an application example. The data to be used is assumed to be 2D time series data such as the number of flows in Stealthwatch. We assume the data because we want to generalize it as much as possible. スクリーンショット 2019-12-06 18.31.42.png Figure 4: Cisco Stealthwatch Time Series Data

From the above time series data, the data for the window size is extracted as a pattern and compared with the pattern in the normal state.

First of all, regarding pattern extraction, what we have to consider this time is that the number of flows rarely becomes 0, and it is considered that traffic is always flowing, especially at intervals of several minutes to several hours, which is the window size. is. If you know that the number will be 0, you can separate it when the number is 0, but if the number is not 0, you have to consider extracting the pattern.

If the window size is small, it is expected that the time zone and time width when the same pattern appears will differ depending on the situation. For example, we know that employees watch YouTube during lunch breaks, so even if we know that the number of flows will rise sharply between 12:00 and 13:00 and it will be past 13:00, we think that the maximum time will vary from day to day. (I don't know if it will take the maximum at 12:31 or the maximum at 12:36).

In this way, when "shapes are similar" regardless of the time width, the degree of similarity is calculated by a method called Dynamic Time Warping --DTW. With this method, it is possible to make a judgment even if the data lengths are not uniform. In other words, it is possible to judge even if the time width is different.

On the other hand, if the deviation with respect to the time axis is considered to be important, use the Euclidean distance. For example, the difference in the number of flows during the day and at night when the window size is large.

Now, let's actually implement DTW in Python. Dynamic programming is used to implement DTW. Also, when measuring the distance between two points, it is considered that there are cases where it is measured with an absolute value and cases where it is measured with an Euclidean distance, so it is separated by the argument of method. The first and second arguments represent two time-series data separated by window size. When comparing with time series data for several days, this function is called multiple times.

def dtw(wave_x, wave_y, method="abs"):
    d = np.zeros([len(wave_x)+1, len(wave_y)+1])
    d[:] = np.inf
    d[0, 0] = 0
    if method = "euclid":
        for i in range(1, d.shape[0]):
            for j in range(1, d.shape[1]):
                cost = np.sqrt((wave_x[i-1] - wave_y[j-1])**2)
                cost = (wave_x[i-1] - wave_y[j-1])
                row.append(cost)
                d[i, j] = cost + min(d[i-1, j], d[i, j-1], d[i-1, j-1])
    else:
        for i in range(1, d.shape[0]):
            for j in range(1, d.shape[1]):
                cost = np.abs(wave_x[i-1] - wave_y[j-1])
                row.append(cost)
                d[i, j] = cost + min(d[i-1, j], d[i, j-1], d[i-1, j-1])
    elapsed_time = time.time() - start_time
    return d[-1][-1], d, matrix

By calculating this DTW multiple times, the distance matrix between multiple time series data can be obtained. Using this distance matrix, consider the classification by the k-medoids method. This time, we set the number of clusters to 2 because we are aiming to classify two patterns of "normal / abnormal".

self.n_cluster = 2

The implementation of the k-medoids method is as follows. Since the k-medoids method is not implemented in scikit-learn, implement it as follows. Substitute the distance matrix for the D_matrix part.

class KMedoids():
    def __init__(self, max_iter=300):
        self.n_cluster = 2
        self.max_iter = max_iter

    def fit_predict(self, D_matrix):
        m, n = D_matrix.shape
        ini_medoids = np.random.choice(range(m), self.n_cluster, replace=False)
        tmp_D = D_matrix[:, ini_medoids]

        labels = np.argmin(tmp_D, axis=1)

        results = pd.DataFrame([range(m), labels]).T
        results.columns = ['id', 'label']

        col_names = ['x_' + str(i + 1) for i in range(m)]
        results = pd.concat([results, pd.DataFrame(D_matrix, columns=col_names)], axis=1)

        old_medoids = ini_medoids
        new_medoids = []

        loop = 0
        while ((len(set(old_medoids).intersection(set(new_medoids))) != self.n_cluster) 
               and (loop < self.max_iter) ):
        if loop > 0:
            old_medoids = new_medoids.copy()
            new_medoids = []
        for i in range(self.n_cluster):
            tmp = results[results['label'] == i].copy()
            tmp['distance'] = np.sum(tmp.loc[:, ['x_' + str(id + 1) for id in tmp['id']]].values, axis=1)
            tmp = tmp.reset_index(drop=True)
            new_medoids.append(tmp.loc[tmp['distance'].idxmin(), 'id'])

        new_medoids = sorted(new_medoids)
        tmp_D = D_matrix[:, new_medoids]

        clustaling_labels = np.argmin(tmp_D, axis=1)
        results['label'] = clustaling_labels
        loop += 1
        results = results.loc[:, ['id', 'label']]
        results['flag_medoid'] = 0

        for medoid in new_medoids:
            results.loc[results['id'] == medoid, 'flag_medoid'] = 1
        tmp_D = pd.DataFrame(tmp_D, columns=['medoid_distance'+str(i) for i in range(self.n_cluster)])
        results = pd.concat([results, tmp_D], axis=1)

        self.results = results
        self.cluster_centers_ = new_medoids
        return results['label'].values

It is possible to classify into two classes above, normal and abnormal. The details of the k-medoids method are omitted here, but the characteristics are not so different from the k-means method, but unlike the k-means method, medoids are calculated and classified, so they are resistant to outliers. In addition, it is possible to classify as long as the distance matrix is obtained, so the application is effective. Here's how to find the medoid.

image.png

Finally

This time, I wrote about anomaly detection methods and application methods. After that, a code example for performing anomaly detection using two-dimensional time series data is shown.

Since the k-medoids method can be classified as long as the distance matrix is obtained, even a character string can be classified by finding the distance. The distance of the character string uses the Jaro Winkler distance and the Levenshtein distance, so please search here as well.

We hope that you will refer to this article and try anomaly detection using the data you have at hand.

Disclaimer

The opinions expressed on this site and the corresponding comments are the personal opinions of the contributor and not the opinions of Cisco. The content of this site is provided for informational purposes only and is not intended to be endorsed or expressed by Cisco or any other party. By posting on this website, you are solely responsible for the content of all information uploaded by posting, linking or otherwise, and disclaiming Cisco from any liability regarding the use of this website. I agree.

Recommended Posts

Time series data anomaly detection for beginners
Anomaly detection of time series data by LSTM (Keras)
LSTM (1) for time series forecasting (for beginners)
[For beginners] Script within 10 lines (5. Resample of time series data using pandas)
[Python] Plot time series data
Anomaly detection by autoencoder using keras [Implementation example for beginners]
Python: Time Series Analysis: Preprocessing Time Series Data
About time series data and overfitting
Differentiation of time series data (discrete)
Movement statistics for time series forecasting
Time series analysis 3 Preprocessing of time series data
Predict time series data with neural network
Output elapsed time for data logging (for yourself)
ECG data anomaly detection by Matrix Profile
[Deep learning] Nogizaka face detection ~ For beginners ~
How to handle time series data (implementation)
Reading OpenFOAM time series data and sets data
Get time series data from k-db.com in Python
Roadmap for beginners
Kaggle Kernel Method Summary [Table Time Series Data]
Time Series Decomposition
Smoothing of time series and waveform data 3 methods (smoothing)
View details of time series data with Remotte
How to read time series data in PyTorch
Basics of pandas for beginners ② Understanding data overview
Implementation of clustering k-shape method for time series data [Unsupervised learning with python Chapter 13]
Let's analyze Covid-19 (Corona) data using Python [For beginners]
Data Science 100 Knock ~ Battle for less than beginners part3
Data Science 100 Knock ~ Battle for less than beginners part6
Features that can be extracted from time series data
[For beginners] How to study Python3 data analysis exam
Data science 100 knocks ~ Battle for less than beginners part5
Data Science 100 Knock ~ Battle for less than beginners part2
Python 3.4 Create Windows7-64bit environment (for financial time series analysis)
Data Science 100 Knock ~ Battle for less than beginners part1
Data science 100 knocks ~ Battle for less than beginners part10
Data Science 100 Knock ~ Battle for less than beginners part7
Data Science 100 Knock ~ Battle for less than beginners part4
Data set for evaluation of spam reviewer detection algorithm
Data science 100 knocks ~ Battle for less than beginners part8
Python Exercise for Beginners # 1 [Basic Data Types / If Statements]
Tool for creating training data for object detection in OpenCV
Time series data prediction by AutoML (automatic machine learning)
Anomaly detection with Amazon Lookout for Vision Part 2 (Python3.6)
Data Science 100 Knock ~ Battle for less than beginners part11
Python: Time Series Analysis
Spacemacs settings (for beginners)
Python time series question
RNN_LSTM1 Time series analysis
Time series analysis 1 Basics
Dijkstra algorithm for beginners
OpenCV for Python beginners
Display TOPIX time series
Time series plot / Matplotlib
Anomaly detection introduction 2 Outlier detection
How to generate exponential pulse time series data in python
How to implement 100 data science knocks for data science beginners (for windows10 Home)
[Understand in the shortest time] Python basics for data analysis
Library tsfresh that automatically extracts features from time series data
Pandas basics for beginners ④ Handling of date and time items
Graph time series data in Python using pandas and matplotlib