at first

This article extracted the degree of dispersion of direction information on the hypersphere. An example using vmFD is prepared in the example of spherecluster used at that time. In one of them, tf-idf is used to vectorize words in a document, and KMeans is used to categorize them according to the center of gravity of the vector. In this article, we will compare document clustering by executing it with general KMeans and movMFD using the example of spherecluster. See here for specific programs. In this article, I will introduce only the main points.

Data introduction

First, I will introduce the data to be used. Includes newspaper scripts and 4 types of category information from sklearn's external datasets Load the text.

###############################################################################
# Load some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
# Uncomment the following to do the analysis on all the categories
#categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)

dataset = fetch_20newsgroups(subset='all', categories=categories,
                             shuffle=True, random_state=42)

The text data is acquired like this. Screen Shot 2017-01-27 at 1.48.27.png

This sentence is vector-converted with TfidfVectorizer.

print("Extracting features from the training dataset using a sparse vectorizer")
vectorizer = TfidfVectorizer(stop_words='english', use_idf=True)
X = vectorizer.fit_transform(dataset.data)

The result is a list of 3387 rows x 43256 columns. If you know tfidf, you don't need to explain it, but it means that there are 3387 sentences and 43256 unique words.

Here, if necessary, Execute LSA. If use_LSA = True, the document data is dimensionally compressed to n_components = 500 in advance. The default is use_LSA = False.

###############################################################################
# LSA for dimensionality reduction (and finding dense vectors)
if use_LSA:
  print("Performing dimensionality reduction using LSA")
  svd = TruncatedSVD(n_components)
  normalizer = Normalizer(copy=False)
  lsa = make_pipeline(svd, normalizer)
  X = lsa.fit_transform(X)

  explained_variance = svd.explained_variance_ratio_.sum()
  print("Explained variance of the SVD step: {}%".format(
      int(explained_variance * 100)))

  print()

This completes the data preparation.

Cluster word vector data

Next, document clustering is performed in four ways: K-means, Spherical K-Means, soft-movMF, and hard-movMF.

# K-Means clustering
km = KMeans(n_clusters=true_k, init='k-means++', n_init=20)

print("Clustering with %s" % km)
km.fit(X)

# Spherical K-Means clustering
skm = SphericalKMeans(n_clusters=true_k, init='k-means++', n_init=20)

print("Clustering with %s" % skm)
skm.fit(X)

# Mixture of von Mises Fisher clustering (soft)
vmf_soft = VonMisesFisherMixture(n_clusters=true_k, posterior_type='soft',
    init='random-class', n_init=20, force_weights=np.ones((true_k,))/true_k)

print("Clustering with %s" % vmf_soft)
vmf_soft.fit(X)

# Mixture of von Mises Fisher clustering (hard)
vmf_hard = VonMisesFisherMixture(n_clusters=true_k, posterior_type='hard',
    init='spherical-k-means', n_init=20, force_weights=np.ones((true_k,))/true_k)

print("Clustering with %s" % vmf_hard)
vmf_hard.fit(X)

Introduction of evaluation index

Here, evaluation is performed using sklearn's metrics. Specifically, we perform the following six evaluations.

homogeneity_score:
completeness_score:
v_measure_score:
adjusted_rand_score:
adjusted_mutual_info_score:
silhouette_score:

For details of each evaluation index, refer to this article. *making

Evaluation of cluster results

Here is the evaluation result.

Screen Shot 2017-01-27 at 5.04.20.png

In all the results, we can see that the index of sphere cluster is superior to KMeans. In particular, KMeans was extended to hyperspheres. Spherical KMeans and movMF-soft and movMF-hard using the von Mises Fisher distribution showed excellent results.

[PYTHON] Text mining: Probability density distribution on the hypersphere and text clustering in KMeans

at first

Data introduction

Cluster word vector data

Introduction of evaluation index

Evaluation of cluster results