Use R density ratio estimation package densratio from Python

This time, the article is a combination of the two themes. One is to call the R package from Python and use it, and the other is to try to detect anomalies using the density ratio estimation R package densratio. is. R is very attractive because there are so many packages of statistical methods published (more than 8000), and some of them are not in Python. However, if you are familiar with Python, you may want to preprocess data and draw graphs in Python, which you are familiar with. (It's about me: sweat_smile :) For those people, here's how to use the R package from Python.

The full text of the Python source code can be found on GitHub here, so please use it as appropriate.

</ i> Environment

The environment I tried is as follows. I'm trying it on Mac and Anaconda. If you have Anaconda installed, no special preparation is required. The content of this article is basically verified with Jupyter Notebook (iPython Notebook).

Python 3.5.1 |Anaconda custom (x86_64)| (default, Dec  7 2015, 11:24:55) 
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin

IPython 5.0.0 

</ i> Call R package from Python

Preparation

First of all, it is necessary that R is installed. Then start R (or RStudio) once and install the dens ratio package as shown below. Used for density ratio estimation. Once the installation is complete, it is OK to exit R.

r


install.packages("densratio")

Access R from Python

Install a library called rpy2 to access R from Python.

bash


pip install rpy2

We will access R using this rpy2. First of all, import various things. Now you can use R objects and convert between Pandas dataframes and R dataframes.

Python


from rpy2.robjects.packages import importr
import rpy2.robjects as robjects

# import pandas.rpy.common as com   # [depricated]
from rpy2.robjects import pandas2ri
pandas2ri.activate()

</ i> Try the density ratio estimation package dens ratio

I will try density ratio estimation using the package densratio made by boss @hoxo_m and anomaly detection using it.

I will read the R dens ratio package in Python immediately, but I can import it by writing only one line as shown below.

Python


densratio = importr("densratio")

Try with one-dimensional data

Density ratio estimation is a method of directly estimating the density ratio when the distribution of input variables called covariate shift is different between training data and test data, instead of estimating each distribution and then taking the ratio. (A covariate is an input variable)

Training data $ \ mathcal {D} = \ {\ boldsymbol {x} ^ {(1)}, \ cdots, \ boldsymbol {x} ^ {(N)} \} $ is only normal data (if abnormal) Even if there is data, it is very small), and test data $ \ mathcal {D}'= \ {\ boldsymbol {x} ^ {'(1)}, \ cdots, \ boldsymbol {x} ^ {' (N')} \} $ contains a certain number of anomalous data, and this anomalous data is detected from the density ratio.

If the training data distribution is $ p_ {tr} (\ boldsymbol {x}) $ and the test data distribution is $ p_ {te} (\ boldsymbol {x}) $, what is the density ratio?

w(\boldsymbol{x}) = {p_{tr}(\boldsymbol{x}) \over p_{te}(\boldsymbol{x})}

It is a value expressed as. Estimate $ \ hat {w} (\ boldsymbol {x}) $ for this $ w (\ boldsymbol {x}) $

\hat{w}(\boldsymbol{x}) = \sum_{l=1}^{N} \alpha_l \varphi_l (\boldsymbol{x})

And it is represented by a linear combination of the basis functions $ \ {\ varphi_l (\ boldsymbol {x}) \} _ {l = 1} ^ {N} $. Here, $ \ varphi_l (\ boldsymbol {x}) $ is called a basis function, and this time the RBF kernel is used for this basis function.

\varphi_l (\boldsymbol{x}) = \exp \left( { \| \boldsymbol{x} - \boldsymbol{x}^{l}\|^2 \over 2\sigma^2} \right)

Is used. Since $ \ boldsymbol {x} ^ {l} $ represents the $ l $ th training data, it looks like a normal distribution with the normalization constants centered on the training data omitted.

The following is an example, assuming that there are 5 training data [0.5, 4.5, 5.0, 7.0, 9.0], from which α = [.1, .25, .35, .2, .1] The weighted kernels are added together, and the bottom row shows how it is represented as the estimated density ratio $ \ hat {w} (\ boldsymbol {x}) $.

img012.png

So this density ratio estimation is a matter of determining the kernel mix ratio $ \ boldsymbol {\ alpha} $. $ \ Sigma $, which indicates the width of the kernel, is a hyperparameter, which is determined by cross-validation.

Please refer to the books, pages, and papers mentioned above for why this works and how to decide $ \ boldsymbol {\ alpha} $.

Generation of artificial data

The usual outlier problem is to calculate the anomaly degree individually and compare it with the threshold value assuming that one new data is obtained, but this method is characterized by the fact that the test data is multiple samples. .. First, this data is artificially generated using random numbers.

Also, since densratio is an R package, use pandas2ri.py2ri to transfer data from the Pandas world to the R world.

Python


rd.seed(71)
n_data1 = 3000
n_data2 = 500
m = [1, 1.2]
sd = [1/8, 1/4]

x = rd.normal(loc=m[0], scale=sd[0], size=n_data1)
y = rd.normal(loc=m[1], scale=sd[1], size=n_data2)

#Convert Python dataframes to R dataframes!
r_x = pandas2ri.py2ri(pd.DataFrame(x))
r_y = pandas2ri.py2ri(pd.DataFrame(y))

Abnormal data has a slightly higher average and a larger variance than normal data.

img001.png

Perform density ratio estimation

Calculate the density ratio using dens ratio for this data. Two types of density ratio estimation methods can be used in this package, KLIEP, which estimates the density ratio using the Kullback-Leibler information amount, and uLSIF, which uses the unconstrained least squares density ratio estimation method. This time, we will use uLSIF, which is also the default and can perform high-speed calculations. Just put the r_x, r_y converted to an R dataframe into this function. It's convenient: grin: (This time $ \ sigma $ was set to 0.2 without cross-validation)

Python


result = densratio.densratio(r_x, r_y, "uLSIF",0.2, "auto") 

A Warning message is displayed during this process, but it is a log of the process of determining hyperparameters by cross-validation instead of Warning, but it is displayed as Warning in the rpy2 specifications.

out


//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: ################## Start uLSIF ##################

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: Searching optimal sigma and lambda...

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning:   sigma = 0.200, lambda = 0.001, score = 11.113

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning:   sigma = 0.200, lambda = 0.003, score = 1.026

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning:   sigma = 0.200, lambda = 0.010, score = -0.654

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning:   sigma = 0.200, lambda = 0.032, score = -0.931

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning:   sigma = 0.200, lambda = 0.100, score = -1.009

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning:   sigma = 0.200, lambda = 0.316, score = -1.029

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: Found optimal sigma = 0.200, lambda = 0.316.

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: Optimizing alpha...

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: End.

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: ################## Finished uLSIF ###############

  warnings.warn(x, RRuntimeWarning)

Extract and execute compute_density_ratio, which is a function that extracts the density ratio from the result, on the Python side. First, get the estimated density ratio w_hat to the normal specimen. Also, if the density ratio is $ \ hat {w} $, the degree of anomaly is minus its logarithm.

a(\boldsymbol{x}^{'(n)}) = -\ln \hat{w}(\boldsymbol{x}^{'(n)})

Since it is expressed by, this conversion is performed and calculated.

In addition, I would like to make an abnormality judgment using the 99% percentile of the distribution of this training data as the threshold of abnormality.

Python


#Calculate and extract the density ratio
compute_density_ratio = get_method(result, "compute_density_ratio")
w_hat = compute_density_ratio(r_x)

#Convert density ratio to anomaly
anom_score = -np.log(w_hat)
thresh = np.percentile(anom_score, 99)    #99 of the density ratio of normal specimens%Set the percentile as a threshold to judge abnormalities

Below is a plot of these w_hat and ʻanom_score`.

img002.png

Next, calculate and draw the density ratio of the test data and the histogram of the degree of anomaly.

Python


#Compute from the result_density_Take out the ratio and execute it. Get the estimated density ratio to the test data.
compute_density_ratio = get_method(result, "compute_density_ratio")
w_hat = compute_density_ratio(r_y)
#Convert density ratio to anomaly
anom_score = -np.log(w_hat)

You can see that there are more test data that exceed the abnormal threshold.

img003.png

Let's compare the original data with the degree of anomaly, count the number of data that exceeds the threshold of the degree of anomaly, and look at the accumulated data. (However, by subtracting the percentile set to the threshold value, the average will be 0 at normal times.) If you think of this data as time series data, the first one is almost normal and does not exceed the threshold value very much. , You can see that the tendency of the latter half data has changed and the data with a high degree of abnormality is suddenly increasing.

img004.png

</ i> For 2D input variables

Now let's try the case where the input variable is two-dimensional. Data is @ oshokawa's slide https://speakerdeck.com/oshokawa/mi-du-bi-tui-ding-niyoruyi-chang-jian-zhi I created it with the same specifications as.

Python


#Generation of training data
rd.seed(71)

cov_n = [[ 10,  0],
         [  0, 10]]
M = 2
m_n_1 = (10, 10)
m_n_2 = (20, 20)
n_data = 500

X_norm = np.empty((2*n_data, M))
X_norm[:n_data] = st.multivariate_normal.rvs(mean=m_n_1, cov=cov_n, size=n_data)
X_norm[n_data:] = st.multivariate_normal.rvs(mean=m_n_2, cov=cov_n, size=n_data)
N_norm = 2*n_data

#Test data generation
cov_n_1 = [[ 5,  0],
           [  0, 5]]
cov_n_2 = [[ .1,  0],
           [  0,.1]]

m_n_1 = (12.5, 12.5)
m_n_2 = (17.5, 17.5)
m_n_3 = (15, 15)
m_n_4 = (17.5, 12.5)
m_n_5 = (12.5, 17.5)

n_data_a = 50
N_anom = 5*n_data_a
X_anomaly = np.empty((n_data_a*5, M))
for i, m, c in zip(range(5), [m_n_1,m_n_2,m_n_3,m_n_4,m_n_5], [cov_n_1,cov_n_1,cov_n_2,cov_n_1,cov_n_1]):
    X_anomaly[n_data_a*i:n_data_a*(i+1)] = st.multivariate_normal.rvs(mean=m, cov=c, size=n_data_a)

X = np.r_[X_norm, X_anomaly]

The data looks like this. A machine has a low-speed rotation mode and a high-speed rotation mode, and the state changes in the middle, but it is made with the image of data that the movement is suspicious toward the end.

img005.png

img007.png

As with 1D data, it is converted to an R data frame and then subjected to dens ratio.

Python


#Convert to R data frame
r_x = pandas2ri.py2ri(pd.DataFrame(X_norm))
r_y = pandas2ri.py2ri(pd.DataFrame(X_anomaly))
r_xall = pandas2ri.py2ri(pd.DataFrame(X))

#Perform density ratio estimation
result = densratio.densratio(r_x, r_y) 

out


//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: ################## Start uLSIF ##################

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: Searching optimal sigma and lambda...

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning:   sigma = 0.001, lambda = 0.001, score = 0.000

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning:   sigma = 0.001, lambda = 3.162, score = -0.000

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning:   sigma = 0.032, lambda = 0.001, score = -0.000

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning:   sigma = 0.100, lambda = 0.001, score = -0.002

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning:   sigma = 0.316, lambda = 0.001, score = -0.352

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning:   sigma = 1.000, lambda = 0.010, score = -2.466

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: Found optimal sigma = 1.000, lambda = 0.010.

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: Optimizing alpha...

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: End.

  warnings.warn(x, RRuntimeWarning)
//anaconda/lib/python3.5/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: ################## Finished uLSIF ###############

  warnings.warn(x, RRuntimeWarning)

The flow of taking out the density ratio, calculating the degree of abnormality, and setting the threshold value from the training data is the same this time as well!

Python


#Get the estimated density ratio
compute_density_ratio = get_method(result, "compute_density_ratio")
w_hat = np.asanyarray(compute_density_ratio(r_xall))

#Abnormality threshold
anom_percentile = 5 #Below the estimated density ratio 5%The following is abnormal

#Set a threshold value that makes abnormalities below the specified percentile in the training data
anom_score = w_hat[:len(X_norm)]
thresh = np.percentile(anom_score, anom_percentile)

Let's plot the density ratio. img008.png

Convert the density ratio of the training data to the degree of anomaly and make a graph. Similarly, the abnormality judgment is performed using the specified percentile of training data (95% point this time) as the threshold value. img009.png

Based on the calculated threshold value, count the degree of abnormality of all data and the number of data that exceeds the threshold value to see the tendency. After all, the degree of abnormality is increasing where the situation in the latter half is strange. img010.png

Let's look at the last estimated density on a heat map. The left is the estimated density ratio, the middle is the density of the training data, and the right is the density of the test data. img011.png

</ i> Reference

Full source code for this page (The broken parts in this article, such as the graph drawing code, are also described here.)  https://github.com/matsuken92/Qiita_Contents/blob/master/General/Density_ratio_R.ipynb

rpy2 package reference  http://rpy2.readthedocs.io/

densratio package (@hoxo_m)   https://github.com/hoxo-m/densratio

Anomaly detection by density ratio estimation (@oshokawa)  https://speakerdeck.com/oshokawa/mi-du-bi-tui-ding-niyoruyi-chang-jian-zhi

"Abnormality detection and change detection" by Tsuyoshi Ide and Masashi Sugiyama (Machine Learning Professional Series)  https://www.amazon.co.jp/dp/4061529080

Singularity detection method using density ratio estimation  http://2boy.org/~yuta/publications/ibis2007.pdf

Recommended Posts