[PYTHON] Clustering with scikit-learn (1)

Overview of clustering

Please refer to the following for the implementation basis. K-means++ K-means Although it is in English, it has more information and is more rigorous than the Japanese page.

Overview of scikit-learn

It's a super useful library that allows you to do all sorts of machine learning with Python. Please read this area.

scikit-learn Introduction of scikit-learn

Premise

As an input file, assume a text file separated by line breaks for each character string. To use it, specify the input / output file as an argument like python clustering.py input.txt output.txt. The result is displayed by print () on the way. The data after clustering is output to the output file. Please prepare a suitable text file and try it out.

Implementation

The whole code is below.

clustering.py

Only the methods that create the cluster are quoted.

`clustering.py`



def make_cluster(self):
    """Create and return a cluster
    """

    #Generate a list of strings to be processed
    texts = self._read_from_file()
    print("texts are %(texts)s" %locals() )

    # TF-Generate IDF vector
    vectorizer = TfidfVectorizer(
        max_df=self.max_df,
        max_features=self.max_features,
        stop_words='english'
        )
    X = vectorizer.fit_transform(texts)
    print("X values are %(X)s" %locals() )

    #Generate and cluster KMeans instances
    #Make sure that the parameters are appropriate according to the amount and characteristics of the data.
    km = MiniBatchKMeans(
        n_clusters=self.num_clusters,
        init='k-means++', batch_size=1000,
        n_init=10, max_no_improvement=10,
        verbose=True
        )
    km.fit(X)
    labels = km.labels_

    #Calculate the cluster to which it belongs and its distance
    transformed = km.transform(X)
    dists = np.zeros(labels.shape)
    for i in range(len(labels)):
        dists[i] = transformed[i, labels[i]]

    clusters = []
    for i in range(self.num_clusters):
        cluster = []
        ii = np.where(labels==i)[0]
        dd = dists[ii]
        di = np.vstack([dd,ii]).transpose().tolist()
        di.sort()
        for d, j in di:
            cluster.append(texts[int(j)])
        clusters.append(cluster)

    #Returns the generated cluster
    return clusters

Consideration

By using scikit-learn, you can see that clustering can be done with extremely simple code. A wide variety of parameters are used everywhere. When using an actual application, it is necessary to tune these according to the characteristics of the data.