[PYTHON] Text mining: Probability density distribution on the hypersphere and text clustering in KMeans

at first

This article extracted the degree of dispersion of direction information on the hypersphere. An example using vmFD is prepared in the example of spherecluster used at that time. In one of them, tf-idf is used to vectorize words in a document, and KMeans is used to categorize them according to the center of gravity of the vector. In this article, we will compare document clustering by executing it with general KMeans and movMFD using the example of spherecluster. See here for specific programs. In this article, I will introduce only the main points.

Data introduction

First, I will introduce the data to be used. Includes newspaper scripts and 4 types of category information from sklearn's external datasets Load the text.

###############################################################################
# Load some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
# Uncomment the following to do the analysis on all the categories
#categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)

dataset = fetch_20newsgroups(subset='all', categories=categories,
                             shuffle=True, random_state=42)

The text data is acquired like this. Screen Shot 2017-01-27 at 1.48.27.png

This sentence is vector-converted with TfidfVectorizer.

print("Extracting features from the training dataset using a sparse vectorizer")
vectorizer = TfidfVectorizer(stop_words='english', use_idf=True)
X = vectorizer.fit_transform(dataset.data)

The result is a list of 3387 rows x 43256 columns. If you know tfidf, you don't need to explain it, but it means that there are 3387 sentences and 43256 unique words.

Here, if necessary, Execute LSA. If use_LSA = True, the document data is dimensionally compressed to n_components = 500 in advance. The default is use_LSA = False.

###############################################################################
# LSA for dimensionality reduction (and finding dense vectors)
if use_LSA:
  print("Performing dimensionality reduction using LSA")
  svd = TruncatedSVD(n_components)
  normalizer = Normalizer(copy=False)
  lsa = make_pipeline(svd, normalizer)
  X = lsa.fit_transform(X)

  explained_variance = svd.explained_variance_ratio_.sum()
  print("Explained variance of the SVD step: {}%".format(
      int(explained_variance * 100)))

  print()

This completes the data preparation.

Cluster word vector data

Next, document clustering is performed in four ways: K-means, Spherical K-Means, soft-movMF, and hard-movMF.

# K-Means clustering
km = KMeans(n_clusters=true_k, init='k-means++', n_init=20)

print("Clustering with %s" % km)
km.fit(X)
# Spherical K-Means clustering
skm = SphericalKMeans(n_clusters=true_k, init='k-means++', n_init=20)

print("Clustering with %s" % skm)
skm.fit(X)
# Mixture of von Mises Fisher clustering (soft)
vmf_soft = VonMisesFisherMixture(n_clusters=true_k, posterior_type='soft',
    init='random-class', n_init=20, force_weights=np.ones((true_k,))/true_k)

print("Clustering with %s" % vmf_soft)
vmf_soft.fit(X)
# Mixture of von Mises Fisher clustering (hard)
vmf_hard = VonMisesFisherMixture(n_clusters=true_k, posterior_type='hard',
    init='spherical-k-means', n_init=20, force_weights=np.ones((true_k,))/true_k)

print("Clustering with %s" % vmf_hard)
vmf_hard.fit(X)

Introduction of evaluation index

Here, evaluation is performed using sklearn's metrics. Specifically, we perform the following six evaluations.

For details of each evaluation index, refer to this article. *making

Evaluation of cluster results

Here is the evaluation result.

Screen Shot 2017-01-27 at 5.04.20.png

In all the results, we can see that the index of sphere cluster is superior to KMeans. In particular, KMeans was extended to hyperspheres. Spherical KMeans and movMF-soft and movMF-hard using the von Mises Fisher distribution showed excellent results.

Recommended Posts

Text mining: Probability density distribution on the hypersphere and text clustering in KMeans
Precautions when drawing the probability density function and the histogram on top of each other in matplotlib
Add lines and text on the image
Defeat the probability density function of the normal distribution
Extract the color of the object in the image with Mask R-CNN and K-Means clustering
A method for clustering points distributed on a hypersphere, which is convenient for text mining and feature extraction of direction information.
Extract and list personal names and place names in the text
Carefully understand the exponential distribution and draw in Python
Plot and understand the multivariate normal distribution in Python
Carefully understand the Poisson distribution and draw in Python
Check the asymptotic nature of the probability distribution in Python
Count the number of characters in the text on the clipboard on mac
Graph the Poisson distribution and the Poisson cumulative distribution in Python and Java, respectively.
[Python] Clustering results by K-means are reduced in dimension by PCA and plotted on a scatter plot.
Hypothesis test and probability distribution
Try transcribing the probability mass function of the binomial distribution in Python
Probability of getting the highest and lowest turnip prices in Atsumori
Install and manage multiple environments of the same distribution on WSL