[PYTHON] Reuse the results of clustering

Data mining is the application of analysis techniques to large amounts of data to discover previously unknown features of the data and gain new insights, often using techniques common to machine learning.

As a simple idea, if you have the result of unsupervised learning, you can apply the knowledge to the rest of the data as well, and you can improve the accuracy of the result as training data.

Classify student grades for the entire grade

Last time introduced an example of grouping students according to their tendency based on their grades. The teacher in charge of these students thought that the students of the entire grade could be classified based on the results of this grouping.

The support vector machine, which is a kind of discriminant function, and the naive Bayes, which is a stochastic classifier, have completely different ideas and methods.

use scikit-learn

There are many great things about scikit-learn, but one of them is the consistent API designed in different ways. , I think it can be implemented with similar code.

import numpy as np
from sklearn.cluster import KMeans
from sklearn.naive_bayes import GaussianNB
from sklearn import svm

#Since the data posting is redundant, it is separated by commas..In the format to read csv
features = np.genfromtxt('data.csv', delimiter=',')

# K-means classify labels by clustering
kmeans_model = KMeans(n_clusters=3, random_state=10).fit(features)
labels = kmeans_model.labels_

clf = GaussianNB() #Naive Bayes classifier with Gaussian kernel
#clf = svm.SVC(kernel='rbf', C=1) #RBF kernel support vector machine

#Train the classifier based on the clustering results
clf.fit(features, labels)

#Data to be tested(Grades of other students in the grade)Read
test_X = np.genfromtxt('test.csv', delimiter=',')

#Classify with a classifier
results = clf.predict(test_X)

#Sort and display results
ranks = []
for result, feature in zip(results, test_X):
    ranks.append([result, feature, feature.sum()])

ranks.sort(key=lambda x:(-x[2]))

for rank in ranks:
    print(rank)

Summary

You can see it by actually preparing sample data, but I think that you can visually understand how the boundaries of each class change with each method. From here on, we will talk about each theory.

Recommended Posts

Reuse the results of clustering

Illustration of the results of the knapsack problem

Clustering of clustering method

Clustering the posture of snapshots of fashion site, Wear

The beginning of cif2cell

The meaning of self

The story of sys.path.append ()

Revenge of the Types: Revenge of types

Difference in results depending on the argument of multiprocess.Process

Clustering G-means that automatically determines the number of clusters

Align the version of chromedriver_binary