[PYTHON] Clustering with scikit-learn (1)

Overview of clustering

Please refer to the following for the implementation basis. K-means++ K-means Although it is in English, it has more information and is more rigorous than the Japanese page.

Overview of scikit-learn

It's a super useful library that allows you to do all sorts of machine learning with Python. Please read this area.

scikit-learn Introduction of scikit-learn

Premise

As an input file, assume a text file separated by line breaks for each character string. To use it, specify the input / output file as an argument like python clustering.py input.txt output.txt. The result is displayed by print () on the way. The data after clustering is output to the output file. Please prepare a suitable text file and try it out.

Implementation

The whole code is below.

clustering.py

Only the methods that create the cluster are quoted.

clustering.py



def make_cluster(self):
    """Create and return a cluster
    """

    #Generate a list of strings to be processed
    texts = self._read_from_file()
    print("texts are %(texts)s" %locals() )

    # TF-Generate IDF vector
    vectorizer = TfidfVectorizer(
        max_df=self.max_df,
        max_features=self.max_features,
        stop_words='english'
        )
    X = vectorizer.fit_transform(texts)
    print("X values are %(X)s" %locals() )

    #Generate and cluster KMeans instances
    #Make sure that the parameters are appropriate according to the amount and characteristics of the data.
    km = MiniBatchKMeans(
        n_clusters=self.num_clusters,
        init='k-means++', batch_size=1000,
        n_init=10, max_no_improvement=10,
        verbose=True
        )
    km.fit(X)
    labels = km.labels_

    #Calculate the cluster to which it belongs and its distance
    transformed = km.transform(X)
    dists = np.zeros(labels.shape)
    for i in range(len(labels)):
        dists[i] = transformed[i, labels[i]]

    clusters = []
    for i in range(self.num_clusters):
        cluster = []
        ii = np.where(labels==i)[0]
        dd = dists[ii]
        di = np.vstack([dd,ii]).transpose().tolist()
        di.sort()
        for d, j in di:
            cluster.append(texts[int(j)])
        clusters.append(cluster)

    #Returns the generated cluster
    return clusters

Consideration

By using scikit-learn, you can see that clustering can be done with extremely simple code. A wide variety of parameters are used everywhere. When using an actual application, it is necessary to tune these according to the characteristics of the data.

Recommended Posts

Clustering with scikit-learn (1)
Clustering with scikit-learn (2)
Clustering with scikit-learn + DBSCAN
DBSCAN (clustering) with scikit-learn
Isomap with Scikit-learn
Clustering with python-louvain
DBSCAN with scikit-learn
PCA with Scikit-learn
kmeans ++ with scikit-learn
Clustering representative schools in summer 2016 with scikit-learn
Cross Validation with scikit-learn
Multi-class SVM with scikit-learn
Learn with chemoinformatics scikit-learn
Install scikit.learn with pip
Calculate tf-idf with scikit-learn
Try using scikit-learn (1) --K-means clustering
Neural network with Python (scikit-learn)
I tried clustering with PyCaret
Clustering ID-POS data with LDA
[Python] Linear regression with scikit-learn
Deep Embedded Clustering with Chainer 2.0
Robust linear regression with scikit-learn
Grid search of hyperparameters with Scikit-learn
Creating a decision tree with scikit-learn
Image segmentation with scikit-image and scikit-learn
Photo segmentation and clustering with DBSCAN
Identify outliers with RandomForestClassifier in scikit-learn
Laplacian eigenmaps with Scikit-learn (personal notes)
Non-negative Matrix Factorization (NMF) with scikit-learn
Try machine learning with scikit-learn SVM
Scikit-learn DecisionTreeClassifier with datetime type values
100 language processing knock-97 (using scikit-learn): k-means clustering
Let's tune the model hyperparameters with scikit-learn!
Revisited scikit-learn
[Scikit-learn] I played with the ROC curve
Multi-label classification by random forest with scikit-learn
[Python] Use string data with scikit-learn SVM
Implement a minimal self-made estimator with scikit-learn
Fill in missing values with Scikit-learn impute
Clustering books from Aozora Bunko with Doc2Vec
Visualize scikit-learn decision trees with Plotly's Treemap
Predict the second round of summer 2016 with scikit-learn
Multivariable regression model with scikit-learn --SVR comparison verification