[PYTHON] A method for clustering points distributed on a hypersphere, which is convenient for text mining and feature extraction of direction information.

at first

Due to research, it was necessary to use 3D vector data as input for machine learning. As the data format, we get 3D vectors with different lengths every few seconds.

Focusing only on the direction of the vector, we normalize the length of the composite vector to 1. The main content is how to extract features from the direction data obtained in such a large amount of time series. In conclusion, the parameters are maximum likelihood estimated from von Mises-Fisher Distribution (vMFD), and the direction data on the hypersphere is clustered by Mixture of vMFD (movMFD). This modeled value is a kind of feature in the directional data. As those who are doing natural language processing will understand, this is a method closely related to the word frequency index tf-idf in documents. It can also be used in systems where the "vector orientation" changes over time in physical space.

The book I referred to this time is Chapter 7, "Abnormality detection of direction data" in "Abnormality detection and change detection". The program used is clara-labs sphere cluster. In addition, I used a part of the detailed explanation of this paper ("Clustering on the Unit Hypersphere using von Mises-Fisher Distributions"). It is. The purpose is to extract the features of D-dimensional vector data, but according to the example of spherecluster, we will do everything up to clustering after extracting the features. Thanks to this package, work efficiency has improved at once. thank you very much.

** This is the first time I have used von Mises-Fisher Distribution and direction information, so I would appreciate it if you could comment if you have any developmental opinions or misunderstandings. ** **

Try using sphere cluster

Package overview

spherecluster is a python package created based on the algorithm introduced in this paper. It is mainly composed of two algorithms, Spherical K-means (spkmeans) and movMF, and there are two types of movMF, soft- and hard-. First, install this sphere cluster.

pip install spherecluster

it's simple. By the way, it depends on packages such as sklearn and numpy.

Algorithms Spherical Clustering A method that extends the concept of KMeans onto a sphere. The calculation method is the same as KMeans, only the input space is defined as a sphere.

About vMFD

vMFD is a general expression method that expresses points distributed on a hypersphere as a probability distribution like a normal distribution. See wikipedia

Screen Shot 2017-01-27 at 13.26.54.png

If you look at the formula, you can see that it is represented by two parameters [$ \ mu $, $ \ kappa $] for the input x. $ \ Mu $ is the average direction of the distributed points, and $ \ kappa $ is the degree of concentration around $ \ mu $. In other words, as shown in the figure below, when the degree of concentration $ \ kappa $ increases, the image is that points exist up to a position away from the average direction $ \ mu $ (solid line).

Wiki image \kappaAnd distribution relationship(\thetaIs the angle from the average direction)
Screen Shot 2017-01-26 at 8.09.47.png

Also, as you can see from the mathematical formula, the calculation of the regularization constant includes a complicated function called the first-class modified Bessel function, and it seems that it will take some time to understand.

The movMFD used this time is the Mixture Distribution of this vMFD. In this article, we will focus on actual usage examples rather than understanding mathematical formulas.

movMFD The likelihood of the mixture distribution composed of the Mixture Model and vMFD is solved by the EM algorithm.

Screen Shot 2017-01-27 at 7.43.35.png

An approach similar to the common Mixed Gaussian model. In other words, it is clustered as an independent vMFD on the sphere in this way.

sphere_w_clusters.png

Example1: small-mix Two-dimensional dummy data is clustered in four ways: KMeans, Spherical KMeans, soft-movMF, and hard-movMF. The actual program is here.

First, install the library.

import numpy as np
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn import metrics

import sys
sys.path.append('../spherecluster')

from spherecluster import SphericalKMeans
from spherecluster import VonMisesFisherMixture
from spherecluster import sample_vMF

Then, it generates data based on the following two vMFs. \mu_0=[-0.251, -0.968], \kappa_0=8 \mu_1=[0.399, 0.917], \kappa_1=2 The number of data num_points_per_class is 100.

# Generate small-mix dataset
mu_0 = np.array([-0.251, -0.968])
mu_1 = np.array([0.399, 0.917])
mus = [mu_0, mu_1]
kappa_0 = 8 # concentration parameter
kappa_1 = 2 # concentration parameter
kappas = [kappa_0, kappa_1]
num_points_per_class = 100

X_0 = sample_vMF(mu_0, kappa_0, num_points_per_class)
X_1 = sample_vMF(mu_1, kappa_1, num_points_per_class)
X = np.zeros((2 * num_points_per_class, 2))
X[:num_points_per_class, :] = X_0
X[num_points_per_class:, :] = X_1
labels = np.zeros((2 * num_points_per_class, ))
labels[num_points_per_class:] = 1

Then fit to K Mean, Spherical K-Means, soft-movMF, hard-mov MF.

At this time, the following error may occur.

ValueError: Data l2-norm must be 1

In the movMF function, the eigenvalue is calculated for each set of input data, but the absolute value of the eigenvalue may be larger than 1e-4 with a probability of about 5% in the data generated by the original sample_vMF. At this time, the program stops due to error processing, so here we will forcibly calculate the eigenvalues and eliminate the cause of the error.

If you run the following program before fitting, the error will disappear. (However, the data that caused the error will also disappear, so the total number of samples will also decrease.)

import scipy.sparse as sp
dummy = []
n_samples = X.shape[0]
for ee in range(n_samples):
    if sp.issparse(X):
        n = sp.linalg.norm(X[ee, :])
    else:
        n = np.linalg.norm(X[ee, :])

    if np.abs(n - 1.) > 1e-4:
        dummy.append([ee, n])

X = np.delete(X, np.array(dummy)[:, 0], 0)

The execution result is as follows.

small_mix_2d.png

Since the original data is neatly divided into two groups, there is no noticeable difference between the methods.

Example2: small-mix_3d Let's cluster the 3D data with the same contents as Example1. Click here for the original code.

If you look at the script, you can see that it is almost the same as Examaple1. Here is the result of executing it.

figure_2.png

By the way, the estimation result is like this.

Screen Shot 2017-01-27 at 10.58.06.png At least for the data generated at hand, it seems to show better performance than kmeans and spkmeans.

Data introduction

Introducing the data handled this time.

Screen Shot 2017-01-26 at 6.56.25.png

id is a value associated with each node, and this time it is a fixed value. [x, y, z] is a three-dimensional vector, azimuth, angle is set aside, and norm is the "length" of the composite of [x, y, z]. Suppose there are about 3000 samples of this data.

EDA

For the time being, only simple visualization is required.

The histogram and correlation diagram of each variable are as follows. Somehow it seems that each parameter has its own characteristics, but I don't know how to extract those characteristics.

download (4).png

Here is the result of visualization in 3D. I don't understand more and more.

download (5).png

With reference to sphere cluster example (make a sphere graph), extract only the direction data from the above scatter plot. Let's plot it on a 3D sphere.

Spherical plot of all data First half and second half spherical plot
download.png download.png

Somehow the features came out. Considering the influence of time, I tried to plot the time stamp of the data separately for the first half (blue) and the second half (red), but it seems that there is no time change especially in the first half and the second half.

Fit to movMF

Finally, movMF is fitted to the test data.

###############################################################################
# Mixture of von Mises Fisher clustering (soft)
vmf_soft = VonMisesFisherMixture(n_clusters=1, posterior_type='soft', n_init=20)
vmf_soft.fit(X)

print "center:{}".format(vmf_soft.cluster_centers_)
print "kappa:{}".format(vmf_soft.concentrations_[0])

The result is as follows.

center:[[-0.82651971  0.17847126  0.53386626]]
kappa:3.91851479297

We were able to successfully model the direction and its degree of variance from the 3D vector with the probability density distribution. For example, if the number of dimensions of X is matched with the number of words appearing in the document, it is possible to model the distribution of points on the D-dimensional hypersphere with movMF and perform clustering.

Recommended Posts

A method for clustering points distributed on a hypersphere, which is convenient for text mining and feature extraction of direction information.
Text mining: Probability density distribution on the hypersphere and text clustering in KMeans
Which method is best for asynchronous processing of TCP server?
Text extraction from images of criteria for determining information on new coronavirus infections in Hyogo Prefecture