Speaking of k-means method, as you all know, it is the one that divides the data.
It's something that often appears in textbooks, so I won't explain it in detail, but what is still used is (prohibited matter).
Oh, by the way, this kind of video or this kind of tool, I think it's nice to understand the movement of k-means well.
So, in k-means, a "distance function" is required to measure the distance between the centroid and each data.
In textbooks, "Euclidean distance" is often selected, but is that really correct?
This paper, Anna Huang, "Similarity Measures for Text Document Clustering", 2008
Will answer such a question clearly.
Let's take a look at Table 4 borrowed from the paper. Since it is an Entropy index, lower is better clustering.
You can see that Euclidean distance is not justice in the text classification task.
Then, it is humanity that makes you want to try other distance functions.
In conclusion, there is no such option.
Mendokusei to investigate. It's not really.
You can see that by looking at the scikit-learn code.
closest_dist_sq = euclidean_distances(
centers[0, np.newaxis], X, Y_norm_squared=x_squared_norms,
squared=True)
current_pot = closest_dist_sq.sum()
This is k-means code snippet, but the Euclidean distance is hardcoded You can see that.
In other words, there is no room to change the distance function as an option.
But what isn't it? Looking at issues on github, it says something like "because it is difficult to generalize with other distance functions". There is.
Still, I'd like to make it a hidden option ...
There seem to be several ways.
What is monkey-patch? In other words, "overwrite a method and do whatever you want".
This time, let's overwrite the Euclidean distance function in k-means. This page describes how to do this.
Let's actually do it.
The data to be used is borrowed from here. It's not text data, but it's cute ...
The Euclidean distance that k-means calls is this.
The behavior can be roughly divided.
is.
In either case
array([
[Distance between X and Y]
])
Returns an array of the form.
So when you define a new distance function, you will follow this format.
Let's implement Pearson's correlation coefficient introduced in the paper. In python it is available in scipy.
However, as mentioned in the paper, it does not become a distance function as it is. Therefore, I will transform the formula a little.
special_pearsonr(X, Y) = {
1 - pearsonr(X, Y) if pearsonr(X, Y) > 0,
|pearsonr(X, Y)| else
}
The formula is transformed.
Now, let's implement it in consideration of the three cases of Euclidean distance.
def special_pearsonr(X, Y):
pearsonr_value = scipy.stats.pearsonr(X, Y)
if pearsonr_value[0] < 0:
return abs(pearsonr_value[0])
else:
return 1 - pearsonr_value[0]
def sub_pearsonr(X, row_index_1, row_index_2):
pearsonr_distance = special_pearsonr(X[row_index_1], X[row_index_2])
return (row_index_1, row_index_2, pearsonr_distance)
def pearsonr_distances(X, Y=None):
if Y==None:
#Only X is entered and X is 2d-For array
row_combinations = list(combinations(range(0, len(X)), 2))
pearsonr_set = [sub_pearsonr(X, index_set_tuple[0], index_set_tuple[1]) for index_set_tuple in row_combinations]
matrix_source_data = pearsonr_set + map(copy_inverse_index, pearsonr_set)
row = [t[0] for t in matrix_source_data]
col = [t[1] for t in matrix_source_data]
data = [t[2] for t in matrix_source_data]
pearsonr_matrix = special_pearsonr((data, (row, col)), (X.shape[0], X.shape[0]))
return pearsonr_matrix.toarray()
elif len(X.shape)==1 and len(Y.shape)==2:
#When X is one sample and Y is multiple samples
#return is an array of 1 sample
pearsonr_set = numpy.array([special_pearsonr(X, Y[y_sample_index]) for y_sample_index in range(0, Y.shape[0])])
return pearsonr_set
elif len(X.shape)==2 and len(Y.shape)==2:
#X on one line[i]And Y[j] in n(Y)Record and return the distance
# X[i]A function that calculates all distances between and Y
pearsonr_x_and_all_y = lambda X, Y: numpy.array([special_pearsonr(X, Y[y_sample_index]) for y_sample_index in range(0, Y.shape[0])])
pearsonr_divergence_set = numpy.array([pearsonr_x_and_all_y(X[x_i], Y) for x_i in range(0, X.shape[0])])
return pearsonr_divergence_set
else:
raise Exception("Exception case caused")
The distance function that pearsonr_distances
calls with k-means.
Now, let's monkey-patch the Euclidean distance function.
start = time.time()
#To guarantee the input, prepare a function with the same parameters as the Euclidean distance function.
def new_euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False):
return pearsonr_distances(X, Y)
from sklearn.cluster import k_means_
#Overwrite here!
k_means_.euclidean_distances = new_euclidean_distances
kmeans_model = KMeans(n_clusters=3, random_state=10, init='random').fit(features)
print(kmeans_model.labels_)
elapsed_time = time.time() - start
print ("Pearsonr k-means:{0}".format(elapsed_time))
I wanted to see the execution time, so I decided to measure the time as well.
I forgot to say that the data is
features = numpy.array([
[ 80, 85, 100 ],
[ 96, 100, 100 ],
[ 54, 83, 98 ],
[ 80, 98, 98 ],
[ 90, 92, 91 ],
[ 84, 78, 82 ],
[ 79, 100, 96 ],
[ 88, 92, 92 ],
[ 98, 73, 72 ],
[ 75, 84, 85 ],
[ 92, 100, 96 ],
[ 96, 92, 90 ],
[ 99, 76, 91 ],
[ 75, 82, 88 ],
[ 90, 94, 94 ],
[ 54, 84, 87 ],
[ 92, 89, 62 ],
[ 88, 94, 97 ],
[ 42, 99, 80 ],
[ 70, 98, 70 ],
[ 94, 78, 83 ],
[ 52, 73, 87 ],
[ 94, 88, 72 ],
[ 70, 73, 80 ],
[ 95, 84, 90 ],
[ 95, 88, 84 ],
[ 75, 97, 89 ],
[ 49, 81, 86 ],
[ 83, 72, 80 ],
[ 75, 73, 88 ],
[ 79, 82, 76 ],
[ 100, 77, 89 ],
[ 88, 63, 79 ],
[ 100, 50, 86 ],
[ 55, 96, 84 ],
[ 92, 74, 77 ],
[ 97, 50, 73 ],
])
is. When I run it,
[0 0 0 1 1 2 1 0 2 1 0 2 2 0 0 1 1 0 1 0 2 0 2 0 2 2 1 1 2 1 1 2 2 2 1 2 2]
Pearsonr k-means:11.0138161182
I was able to confirm that it was working firmly! Yattane!
By the way, the normal k-means, Euclidean distance version is
[1 1 0 1 1 2 1 1 2 1 1 1 2 1 1 0 2 1 0 0 2 0 2 0 1 1 1 0 2 2 2 2 2 2 0 2 2]
========================================
Normal k-means:0.016499042511
is. You can see if there is some data with different clusters of 0 and 1.
The execution time is ... Unfortunately, there is a difference of nearly 600 times ...
Since the number of functions to call has increased, it can't be helped. Around this time, if you devise an implementation, there is a good chance that it will be a little faster. (Erotic people who will speed up. Recruitment)
I haven't seen the difference because I haven't used it for document classification, but since it has been proved in the paper, Pearson's correlation coefficient will surely give good results (should be ...).
I also implemented the Kullback-Leibler distance. However, unfortunately, it stops with an error.
When measuring the distance between X and Y, a negative factor is inevitably generated and the distance becomes inf, so it cannot even be initialized in the first place.
I wonder what to do with the negative factors, but what should I do? I didn't mention it in the paper.
from sklearn.preprocessing import normalize
def averaged_kullback_leibler(X, Y):
#norm_X = normalize(X=X, norm='l1')[0]
#norm_Y = normalize(X=Y, norm='l1')[0]
norm_X = X
norm_Y = Y
pie_1 = norm_X / (norm_X + norm_Y)
pie_2 = norm_Y / (norm_X + norm_Y)
M = (pie_1 * norm_X) + (pie_2 * norm_Y)
KLD_averaged = (pie_1 * scipy.stats.entropy(pk=norm_X, qk=M)) + (pie_2 * scipy.stats.entropy(pk=norm_Y, qk=M))
return KLD_averaged
def sub_KL_divergence(X, row_index_1, row_index_2):
#kl_distance = scipy.stats.entropy(pk=X[row_index_1], qk=X[row_index_2])
kl_distance = averaged_kullback_leibler(X[row_index_1], X[row_index_2])
return (row_index_1, row_index_2, kl_distance)
def copy_inverse_index(row_col_data_tuple):
return (row_col_data_tuple[1], row_col_data_tuple[0], row_col_data_tuple[2])
def KLD_distances(X, Y=None):
if Y==None:
#Only X is entered and X is 2d-For array
row_combinations = list(combinations(range(0, len(X)), 2))
kl_divergence_set = [sub_KL_divergence(X, index_set_tuple[0], index_set_tuple[1]) for index_set_tuple in row_combinations]
matrix_source_data = kl_divergence_set + map(copy_inverse_index, kl_divergence_set)
row = [t[0] for t in matrix_source_data]
col = [t[1] for t in matrix_source_data]
data = [t[2] for t in matrix_source_data]
kl_distance_matrix = scipy.sparse.csr_matrix((data, (row, col)), (X.shape[0], X.shape[0]))
return kl_distance_matrix.toarray()
elif len(X.shape)==1 and len(Y.shape)==2:
#When X is one sample and Y is multiple samples
#return is an array of 1 sample
kl_divergence_set = numpy.array([averaged_kullback_leibler(X, Y[y_sample_index]) for y_sample_index in range(0, Y.shape[0])])
return kl_divergence_set
elif len(X.shape)==2 and len(Y.shape)==2:
#X on one line[i]And Y[j] in n(Y)Record and return the distance
# X[i]A function that calculates all distances between and Y
for x_i in range(0, X.shape[0]):
xx = X[x_i]
for y_sample_index in range(0, Y.shape[0]):
#print(xx)
#print(Y[y_sample_index])
print(averaged_kullback_leibler(xx, Y[y_sample_index]))
kld_x_and_all_y = lambda X, Y: numpy.array([averaged_kullback_leibler(X, Y[y_sample_index]) for y_sample_index in range(0, Y.shape[0])])
kl_divergence_set = numpy.array([kld_x_and_all_y(X[x_i], Y) for x_i in range(0, X.shape[0])])
return kl_divergence_set
else:
raise Exception("Exception case caused")
Recommended Posts