[PYTHON] [Curse of dimensionality] If the number of sensors is changed to ∞, can anomaly be detected?

Since the IOT epidemic, efforts have been made to increase the number of sensors in order to predict and maintain machines. It is being done.

Rumor has it that there have been cases where 100 sensors are attached to one machine.

So the question is, is it justice to increase the number of sensors? In other words, does ** adding extra sensors affect ** anomaly detection performance? When That is to say.

In this article, we focus on the curse of dimensionality and say, "Anomaly detection when increasing the number of extra sensors. What will happen to the performance? "

image.png

From the conclusion

Assumed scene

From the conclusion, if the extra sensor keeps emitting zero signal, the abnormality detection performance will be It can be said that there is almost no change. However, because it contains noise, it becomes a curse of dimensionality. If you increase the number of sensors under the influence of, the abnormality detection performance will deteriorate.

noise

Sensors almost always contain noise. Even with a sensor with good accuracy It contains a small amount of noise. Low (high) pass filter to remove noise There are measures to install such as, but such processing is out of the scope of this article.

In this article, we assume a scene that uses ** raw sensor data containing noise **.

What is the curse of dimensionality?

As the number of dimensions of the data increases, the surface volume occupies most of the total volume. It is a phenomenon that becomes like this. The problem with machine learning is the difference in distance between the nearest point and the farthest point. The point is that it almost disappears and it becomes difficult to distinguish by distance. See the following articles for details.

About the curse of dimensionality

Impact on supervised learning

Personally, in supervised learning, ** dimensionality reduction etc. can be explicitly incorporated, so the curse of dimensionality I don't think it will be affected. ** Extreme talk, even with extra sensor information It can be said that it is better to reduce unnecessary features and use the features that give the highest accuracy.

Impact on unsupervised learning

However, in unsupervised learning such as anomaly detection, there is basically no anomaly data, or It is possible that you have only a small amount of abnormal data at hand. And a small amount of abnormal data For reference, if you reduce the features, there is a risk that you will ** reduce the features you really need. Therefore, if you easily reduce the dimension by unsupervised learning, the abnormality detection performance may deteriorate. There is.

However, is it okay to put unnecessary sensor information into the detector as it is? Anomaly detection performance Isn't it deteriorated? The question also comes up. In other words, unnecessary sensor information The number of dimensions increases because it is inserted, and it is difficult to distinguish between normal and abnormal due to the curse of dimensionality. Isn't it? The question is. Therefore, we will conduct an experiment using dummy data.

Experiment

As mentioned at the beginning, experiment with the above settings. The following two methods are used as anomaly detection methods.

I will omit a detailed introduction, but MT method applies normal data to the normal distribution and uses the Mahalanobis distance. Determine if it is abnormal. The larger the Mahalanobis distance, the higher the degree of abnormality. Isolation Forest is a decision tree-based anomaly detection method. Thesis of the head family has 500 dimensions It has been shown to be valid for more data.

I put the whole code here [https://github.com/shinmura0/Number-of-Sensor/blob/master/Infinity_sensor.ipynb.ipynb).

Results of MT method

First, generate a working sensor ($ x_1, x_2 $) using random numbers.

image.png

From the figure on the left above, you can see that there is a correlation between $ x_1 and x_2 $. The green dots are the training data. If you replace $ x_1 $ with temperature, $ x_2 $ with pressure, etc. It may be easy to understand.

The purple dots are normal data and the red dots are abnormal data.

When the MT method is applied in the $ x_1, x_2 $ space, it is normal / abnormal as shown in the right figure above. There is a clear difference in the abnormal score (MD = Mahalanobis distance). The larger the Mahalanobis distance, the higher the degree of abnormality. By the way, the light blue line is An equal probability ellipse is a line that represents areas with the same Mahalanobis distance.

When the number of dimensions is changed from 2 to 3

Increase the number of dimensions by one ($ x_3 $).

image.png

Added one extra sensor information like the one on the right ($ x_3 $) above. $ x_3 $ is, for example, a brightness sensor You may attach it. $ x_1 and x_2 $ (left and middle figures) were correlated and meaningful data, $ x_3 $ has no correlation and is just noisy data.

The $ x_1, x_3 $ space is illustrated below.

image.png

Looking only at this figure, the difference between normal / abnormal data is not so large, and it is a way of riding noise. Therefore, it is likely that normal data will be outliers. And that's it This is a factor that makes it difficult to distinguish between abnormal and normal.

$ x_1, x_2, x_3 $ If you apply the MT method over the entire space, the anomaly score will be as follows.

image.png

The difference is smaller than when the number of dimensions is 2, but the abnormal data still has a higher score. It's getting bigger.

When the number of dimensions is changed from 3 to 100

The result of continuing to increase $ x_3 $ to 98 as before is as follows.

image.png

The horizontal axis is the number of dimensions, and the vertical axis is the abnormal score (MD = Mahalanobis distance). As you can see, when the number of dimensions is 20, the normal and abnormal scores are reversed. In other words, it is a false positive.

Since it is an experiment using random numbers, the results will change from experiment to experiment, but all the results are as long as the number of dimensions is small. Normal and abnormal could be detected correctly.

Isolation Forest results

When the number of dimensions is changed from 2 to 100

The result is similar to the MT method.

image.png

After all, when the number of dimensions is 20, normal and abnormal are reversed, and false detection is performed. In addition, Isolation Forest uses scikit-learn, but the abnormal score is for the sake of clarity. The numbers are inverted. (In the figure above, the higher the abnormality score, the higher the degree of abnormality.)

To avoid the curse of dimensionality

As a result, if you enter too much unnecessary sensor information, the number of dimensions will increase, and due to the curse of dimensionality It has become difficult to distinguish between normal and abnormal. However, the sensor information is unnecessarily If you drop it, there is a risk that the abnormality detection performance will deteriorate. Solve this dilemma The methods are as follows.

Summary

Next time will introduce a method to find the cause of anomaly detection. Using this technique, it is possible to narrow down the effective sensors and ** reduce the number of extra sensors. ** **

Recommended Posts

[Curse of dimensionality] If the number of sensors is changed to ∞, can anomaly be detected?
Python tricks: a combination of enumerate () and zip (), checking if a string can be converted to a number, sorting the string as a number
[Python] A program to find the number of apples and oranges that can be harvested
Have python check if the string can be converted / converted to int
[Python] A program that calculates the number of socks to be paired
Is the number equivalent to an integer?
Even if the development language is changed to python3 in Cloud9, version 2 is displayed in python --version
What to do if (base) is displayed at the beginning of the Mac terminal
A script that can perform stress tests according to the number of CPU cores
What to do if the progress bar is not displayed in tqdm of python
How to check in Python if one of the elements of a list is in another list
Check if the string is a number in python
How to know the port number of the xinetd service
How to get the number of digits in Python
Try to estimate the number of likes on Twitter
When using tf.print (), the contents of the tensor cannot be displayed if it is inside f-string.