Please refer to the following for the implementation basis. K-means++ K-means Although it is in English, it has more information and is more rigorous than the Japanese page.
It's a super useful library that allows you to do all sorts of machine learning with Python. Please read this area.
scikit-learn Introduction of scikit-learn
As an input file, assume a text file separated by line breaks for each character string. To use it, specify the input / output file as an argument like python clustering.py input.txt output.txt. The result is displayed by print () on the way. The data after clustering is output to the output file. Please prepare a suitable text file and try it out.
The whole code is below.
Only the methods that create the cluster are quoted.
clustering.py
def make_cluster(self):
"""Create and return a cluster
"""
#Generate a list of strings to be processed
texts = self._read_from_file()
print("texts are %(texts)s" %locals() )
# TF-Generate IDF vector
vectorizer = TfidfVectorizer(
max_df=self.max_df,
max_features=self.max_features,
stop_words='english'
)
X = vectorizer.fit_transform(texts)
print("X values are %(X)s" %locals() )
#Generate and cluster KMeans instances
#Make sure that the parameters are appropriate according to the amount and characteristics of the data.
km = MiniBatchKMeans(
n_clusters=self.num_clusters,
init='k-means++', batch_size=1000,
n_init=10, max_no_improvement=10,
verbose=True
)
km.fit(X)
labels = km.labels_
#Calculate the cluster to which it belongs and its distance
transformed = km.transform(X)
dists = np.zeros(labels.shape)
for i in range(len(labels)):
dists[i] = transformed[i, labels[i]]
clusters = []
for i in range(self.num_clusters):
cluster = []
ii = np.where(labels==i)[0]
dd = dists[ii]
di = np.vstack([dd,ii]).transpose().tolist()
di.sort()
for d, j in di:
cluster.append(texts[int(j)])
clusters.append(cluster)
#Returns the generated cluster
return clusters
By using scikit-learn, you can see that clustering can be done with extremely simple code. A wide variety of parameters are used everywhere. When using an actual application, it is necessary to tune these according to the characteristics of the data.
Recommended Posts