[PYTHON] Reuse the results of clustering

Data mining is the application of analysis techniques to large amounts of data to discover previously unknown features of the data and gain new insights, often using techniques common to machine learning.

As a simple idea, if you have the result of unsupervised learning, you can apply the knowledge to the rest of the data as well, and you can improve the accuracy of the result as training data.

Classify student grades for the entire grade

Last time introduced an example of grouping students according to their tendency based on their grades. The teacher in charge of these students thought that the students of the entire grade could be classified based on the results of this grouping.

The support vector machine, which is a kind of discriminant function, and the naive Bayes, which is a stochastic classifier, have completely different ideas and methods.

use scikit-learn

There are many great things about scikit-learn, but one of them is the consistent API designed in different ways. , I think it can be implemented with similar code.

import numpy as np
from sklearn.cluster import KMeans
from sklearn.naive_bayes import GaussianNB
from sklearn import svm

#Since the data posting is redundant, it is separated by commas..In the format to read csv
features = np.genfromtxt('data.csv', delimiter=',')

# K-means classify labels by clustering
kmeans_model = KMeans(n_clusters=3, random_state=10).fit(features)
labels = kmeans_model.labels_

clf = GaussianNB() #Naive Bayes classifier with Gaussian kernel
#clf = svm.SVC(kernel='rbf', C=1) #RBF kernel support vector machine

#Train the classifier based on the clustering results
clf.fit(features, labels)

#Data to be tested(Grades of other students in the grade)Read
test_X = np.genfromtxt('test.csv', delimiter=',')

#Classify with a classifier
results = clf.predict(test_X)

#Sort and display results
ranks = []
for result, feature in zip(results, test_X):
    ranks.append([result, feature, feature.sum()])

ranks.sort(key=lambda x:(-x[2]))

for rank in ranks:
    print(rank)

Summary

You can see it by actually preparing sample data, but I think that you can visually understand how the boundaries of each class change with each method. From here on, we will talk about each theory.

Recommended Posts

Reuse the results of clustering
Illustration of the results of the knapsack problem
Clustering of clustering method
Clustering the posture of snapshots of fashion site, Wear
The beginning of cif2cell
The meaning of self
The story of sys.path.append ()
Revenge of the Types: Revenge of types
Difference in results depending on the argument of multiprocess.Process
Clustering G-means that automatically determines the number of clusters
Align the version of chromedriver_binary
Scraping the result of "Schedule-kun"
10. Counting the number of lines
The story of building Zabbix 4.4
Towards the retirement of Python2
Reuse the behavior of the @property method by using a descriptor [16/100]
Visualize the results of decision trees performed with Python scikit-learn
[Apache] The story of prefork
Compare the fonts of jupyter-themes
About the ease of Python
Get the number of digits
Data analysis based on the election results of the Tokyo Governor's election (2020)
How to summarize the results of FreeSurfer ~ aparc, aseg, wmparc ~
GoPiGo3 of the old man
Calculate the number of changes
Change the theme of Jupyter
The popularity of programming languages
Change the style of matplotlib
Visualize the orbit of Hayabusa2
About the components of Luigi
Connected components of the graph
Filter the output of tracemalloc
About the features of Python
Simulation of the contents of the wallet
The Power of Pandas: Python
Save the results of crawling with Scrapy to the Google Data Store
The story of running python and displaying the results without closing vim
[GWAS] Plot the results of principal component analysis (PCA) by PLINK
The specifications of pytz have changed
Test the version of the argparse module
Find the definition of the value of errno
Plot the spread of the new coronavirus
The story of Python and the story of NaN
Raise the version of pyenv itself
Get the number of views of Qiita
[Python] The stumbling block of import
First Python 3 ~ The beginning of repetition ~
Japanese translation of the e2fsprogs manual
The story of participating in AtCoder
Change the background of Ubuntu (GNOME)
Is the probability of precipitation correct?
I investigated the mechanism of flask-login!
Step into the darkness of msync
Take the execution log of Celery
Test the goodness of fit of the distribution
Calculation of the number of Klamer correlations
pyenv-change the python version of virtualenv
About the return value of pthread_mutex_init ()
Combine the overlap of one-dimensional intervals
Get the attributes of an object
Solve the delay of interferometer observation