Since the IOT epidemic, efforts have been made to increase the number of sensors in order to predict and maintain machines. It is being done.

Rumor has it that there have been cases where 100 sensors are attached to one machine.

So the question is, is it justice to increase the number of sensors? In other words, does ** adding extra sensors affect ** anomaly detection performance? When That is to say.

In this article, we focus on the curse of dimensionality and say, "Anomaly detection when increasing the number of extra sensors. What will happen to the performance? "

The entire code is placed in here.
This is the presentation material of Python Data Analysis Study Group # 17.

From the conclusion

Extra sensor information can reduce anomaly detection performance
In order to avoid it, it is necessary to take measures such as reducing extra sensors.

Assumed scene

Only two sensors are effective for anomaly detection
It also has an extra sensor and emits almost zero signal, but ** noise is included (important) **
What happens to the abnormal score if you increase the number of extra sensors?

From the conclusion, if the extra sensor keeps emitting zero signal, the abnormality detection performance will be It can be said that there is almost no change. However, because it contains noise, it becomes a curse of dimensionality. If you increase the number of sensors under the influence of, the abnormality detection performance will deteriorate.

noise

Sensors almost always contain noise. Even with a sensor with good accuracy It contains a small amount of noise. Low (high) pass filter to remove noise There are measures to install such as, but such processing is out of the scope of this article.

In this article, we assume a scene that uses ** raw sensor data containing noise **.

What is the curse of dimensionality?

As the number of dimensions of the data increases, the surface volume occupies most of the total volume. It is a phenomenon that becomes like this. The problem with machine learning is the difference in distance between the nearest point and the farthest point. The point is that it almost disappears and it becomes difficult to distinguish by distance. See the following articles for details.

About the curse of dimensionality

Impact on supervised learning

Personally, in supervised learning, ** dimensionality reduction etc. can be explicitly incorporated, so the curse of dimensionality I don't think it will be affected. ** Extreme talk, even with extra sensor information It can be said that it is better to reduce unnecessary features and use the features that give the highest accuracy.

Impact on unsupervised learning

However, in unsupervised learning such as anomaly detection, there is basically no anomaly data, or It is possible that you have only a small amount of abnormal data at hand. And a small amount of abnormal data For reference, if you reduce the features, there is a risk that you will ** reduce the features you really need. Therefore, if you easily reduce the dimension by unsupervised learning, the abnormality detection performance may deteriorate. There is.

However, is it okay to put unnecessary sensor information into the detector as it is? Anomaly detection performance Isn't it deteriorated? The question also comes up. In other words, unnecessary sensor information The number of dimensions increases because it is inserted, and it is difficult to distinguish between normal and abnormal due to the curse of dimensionality. Isn't it? The question is. Therefore, we will conduct an experiment using dummy data.

Experiment

Only two sensors are effective for anomaly detection
It also has an extra sensor and emits almost zero signal, but it contains noise.
What happens to the abnormal score if you increase the number of extra sensors? Observe

As mentioned at the beginning, experiment with the above settings. The following two methods are used as anomaly detection methods.

MT method
Isolation Forest

I will omit a detailed introduction, but MT method applies normal data to the normal distribution and uses the Mahalanobis distance. Determine if it is abnormal. The larger the Mahalanobis distance, the higher the degree of abnormality. Isolation Forest is a decision tree-based anomaly detection method. Thesis of the head family has 500 dimensions It has been shown to be valid for more data.

I put the whole code here [https://github.com/shinmura0/Number-of-Sensor/blob/master/Infinity_sensor.ipynb.ipynb).

Results of MT method

First, generate a working sensor ($ x_1, x_2 $) using random numbers.

From the figure on the left above, you can see that there is a correlation between $ x_1 and x_2 $. The green dots are the training data. If you replace $ x_1 $ with temperature, $ x_2 $ with pressure, etc. It may be easy to understand.

The purple dots are normal data and the red dots are abnormal data.

When the MT method is applied in the $ x_1, x_2 $ space, it is normal / abnormal as shown in the right figure above. There is a clear difference in the abnormal score (MD = Mahalanobis distance). The larger the Mahalanobis distance, the higher the degree of abnormality. By the way, the light blue line is An equal probability ellipse is a line that represents areas with the same Mahalanobis distance.

When the number of dimensions is changed from 2 to 3

Increase the number of dimensions by one ($ x_3 $).

Added one extra sensor information like the one on the right ($ x_3 $) above. $ x_3 $ is, for example, a brightness sensor You may attach it. $ x_1 and x_2 $ (left and middle figures) were correlated and meaningful data, $ x_3 $ has no correlation and is just noisy data.

The $ x_1, x_3 $ space is illustrated below.

Looking only at this figure, the difference between normal / abnormal data is not so large, and it is a way of riding noise. Therefore, it is likely that normal data will be outliers. And that's it This is a factor that makes it difficult to distinguish between abnormal and normal.

$ x_1, x_2, x_3 $ If you apply the MT method over the entire space, the anomaly score will be as follows.

The difference is smaller than when the number of dimensions is 2, but the abnormal data still has a higher score. It's getting bigger.

When the number of dimensions is changed from 3 to 100

The result of continuing to increase $ x_3 $ to 98 as before is as follows.

The horizontal axis is the number of dimensions, and the vertical axis is the abnormal score (MD = Mahalanobis distance). As you can see, when the number of dimensions is 20, the normal and abnormal scores are reversed. In other words, it is a false positive.

Since it is an experiment using random numbers, the results will change from experiment to experiment, but all the results are as long as the number of dimensions is small. Normal and abnormal could be detected correctly.

Isolation Forest results

When the number of dimensions is changed from 2 to 100

The result is similar to the MT method.

After all, when the number of dimensions is 20, normal and abnormal are reversed, and false detection is performed. In addition, Isolation Forest uses scikit-learn, but the abnormal score is for the sake of clarity. The numbers are inverted. (In the figure above, the higher the abnormality score, the higher the degree of abnormality.)

To avoid the curse of dimensionality

As a result, if you enter too much unnecessary sensor information, the number of dimensions will increase, and due to the curse of dimensionality It has become difficult to distinguish between normal and abnormal. However, the sensor information is unnecessarily If you drop it, there is a risk that the abnormality detection performance will deteriorate. Solve this dilemma The methods are as follows.

Cosine similarity ~~ Generally, angle-based methods such as ** cosine similarity ** are less susceptible to the curse of dimensionality. It is said. ~~ (← I think I saw it somewhere, but I can't remember the source, so far I'm erasing. If anyone knows any references or articles, please let me know. )
Subdivide sensor information For example, if you have 100 sensor information, instead of pushing them into one detector Here is an idea ** to make a detector by dividing the sensor information into two pieces. This makes it a curse of dimensionality You can mitigate the impact. Assuming that we made detectors by brute force, we decided to make $ 100C_2 = 4950 $ detectors. Become. I am concerned about the processing speed of 4950 detectors, but the MT method can process at high speed. Isolation Forest is heavy processing, so real-time processing is difficult, but if you are offline I think it is a usable level. However, since we are only looking at two relationships, there are three or more relationships. In some cases, you may miss an anomaly.
Reduce extra sensors This is the easiest and clearest. ** If you have abnormal data, which sensor information is effective? You can squeeze it. ** This allows you to remove extra sensors and avoid higher dimensions. I can do it. However, as mentioned at the beginning, if the amount of abnormal data is small, the sensor that is really necessary If there is a risk of reduction and there is no abnormal data, while collecting abnormal data The disadvantage is that the detector needs to be upgraded. The MT method uses the SN ratio You can narrow down the sensors that are working. In Next article, not only the MT method but also other methods I will show you how to narrow down the effective sensors.

Summary

If you keep increasing the extra sensor information, the anomaly detection performance may decrease.
To avoid that, it is necessary to devise such as using ~~ cosine similarity, ~~ subdividing the sensor information, etc.
If the number of extra sensors can be reduced, deterioration of anomaly detection performance can be avoided, costs can be reduced, and two birds with one stone.

Next time will introduce a method to find the cause of anomaly detection. Using this technique, it is possible to narrow down the effective sensors and ** reduce the number of extra sensors. ** **

[PYTHON] [Curse of dimensionality] If the number of sensors is changed to ∞, can anomaly be detected?