What to introduce in this article

Is the distance function of kmeans really Euclidean distance justice?
What is working with scikit-learn's k-means?
How to change the distance function with k-means of scikit-learn?

Speaking of k-means method, as you all know, it is the one that divides the data.

It's something that often appears in textbooks, so I won't explain it in detail, but what is still used is (prohibited matter).

Oh, by the way, this kind of video or this kind of tool, I think it's nice to understand the movement of k-means well.

So, in k-means, a "distance function" is required to measure the distance between the centroid and each data.

In textbooks, "Euclidean distance" is often selected, but is that really correct?

This paper, Anna Huang, "Similarity Measures for Text Document Clustering", 2008

Will answer such a question clearly.

Let's take a look at Table 4 borrowed from the paper. Since it is an Entropy index, lower is better clustering.

スクリーンショット 2015-11-04 20.24.24.png

You can see that Euclidean distance is not justice in the text classification task.

Then, it is humanity that makes you want to try other distance functions.

Let's do it with scikit-learn

In conclusion, there is no such option.

Mendokusei to investigate. It's not really.

You can see that by looking at the scikit-learn code.

    closest_dist_sq = euclidean_distances(
        centers[0, np.newaxis], X, Y_norm_squared=x_squared_norms,
        squared=True)
    current_pot = closest_dist_sq.sum()

This is k-means code snippet, but the Euclidean distance is hardcoded You can see that.

In other words, there is no room to change the distance function as an option.

But what isn't it? Looking at issues on github, it says something like "because it is difficult to generalize with other distance functions". There is.

Still, I'd like to make it a hidden option ...

Then how do you use other distance functions! ??

There seem to be several ways.

Write k-means yourself. On the stack-overflow page (http://stackoverflow.com/questions/5529625/is-it-possible-to-specify-your-own-distance-function-using-scikit-learn-k-means) The method is written
Use the k-centroid method. It seems that it can be used with scikit-learn. I haven't checked it.
Use monkey-patch. This time I will do this.

What is monkey-patch? In other words, "overwrite a method and do whatever you want".

This time, let's overwrite the Euclidean distance function in k-means. This page describes how to do this.

Let's actually do it.

The data to be used is borrowed from here. It's not text data, but it's cute ...

Let's see the behavior of scikit-learn's Euclidean distance

The Euclidean distance that k-means calls is this.

The behavior can be roughly divided.

If Y is None
When X is an array of one sample and Y is an array of multiple samples
When X is an array of 2 samples and Y is an array of multiple samples

is.

In either case

array([
    [Distance between X and Y]
])

Returns an array of the form.

So when you define a new distance function, you will follow this format.

Implement the Pearson correlation coefficient distance function

Let's implement Pearson's correlation coefficient introduced in the paper. In python it is available in scipy.

However, as mentioned in the paper, it does not become a distance function as it is. Therefore, I will transform the formula a little.

special_pearsonr(X, Y) = {
    1 - pearsonr(X, Y) if pearsonr(X, Y) > 0,
    |pearsonr(X, Y)| else
}

The formula is transformed.

Now, let's implement it in consideration of the three cases of Euclidean distance.

def special_pearsonr(X, Y):
    pearsonr_value = scipy.stats.pearsonr(X, Y)
    if pearsonr_value[0] < 0:
        return abs(pearsonr_value[0])
    else:
        return 1 - pearsonr_value[0]


def sub_pearsonr(X, row_index_1, row_index_2):
    pearsonr_distance = special_pearsonr(X[row_index_1], X[row_index_2])
    return (row_index_1, row_index_2, pearsonr_distance)


def pearsonr_distances(X, Y=None):
    if Y==None:
        #Only X is entered and X is 2d-For array
        row_combinations = list(combinations(range(0, len(X)), 2))
        pearsonr_set = [sub_pearsonr(X, index_set_tuple[0], index_set_tuple[1]) for index_set_tuple in row_combinations]
        matrix_source_data = pearsonr_set + map(copy_inverse_index, pearsonr_set)

        row = [t[0] for t in matrix_source_data]
        col = [t[1] for t in matrix_source_data]
        data = [t[2] for t in matrix_source_data]

        pearsonr_matrix = special_pearsonr((data, (row, col)), (X.shape[0], X.shape[0]))
        return pearsonr_matrix.toarray()

    elif len(X.shape)==1 and len(Y.shape)==2:
        #When X is one sample and Y is multiple samples
        #return is an array of 1 sample
        pearsonr_set = numpy.array([special_pearsonr(X, Y[y_sample_index]) for y_sample_index in range(0, Y.shape[0])])

        return pearsonr_set

    elif len(X.shape)==2 and len(Y.shape)==2:
        #X on one line[i]And Y[j] in n(Y)Record and return the distance
        # X[i]A function that calculates all distances between and Y
        pearsonr_x_and_all_y = lambda X, Y: numpy.array([special_pearsonr(X, Y[y_sample_index]) for y_sample_index in range(0, Y.shape[0])])
        pearsonr_divergence_set = numpy.array([pearsonr_x_and_all_y(X[x_i], Y) for x_i in range(0, X.shape[0])])
        return pearsonr_divergence_set
    else:
        raise Exception("Exception case caused")

The distance function that pearsonr_distances calls with k-means.

Now, let's monkey-patch the Euclidean distance function.

start = time.time()

#To guarantee the input, prepare a function with the same parameters as the Euclidean distance function.
def new_euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False):
    return pearsonr_distances(X, Y)

from sklearn.cluster import k_means_

#Overwrite here!
k_means_.euclidean_distances = new_euclidean_distances
kmeans_model = KMeans(n_clusters=3, random_state=10, init='random').fit(features)
print(kmeans_model.labels_)
elapsed_time = time.time() - start
print ("Pearsonr k-means:{0}".format(elapsed_time))

I wanted to see the execution time, so I decided to measure the time as well.

I forgot to say that the data is

features = numpy.array([
        [  80,  85, 100 ],
        [  96, 100, 100 ],
        [  54,  83,  98 ],
        [  80,  98,  98 ],
        [  90,  92,  91 ],
        [  84,  78,  82 ],
        [  79, 100,  96 ],
        [  88,  92,  92 ],
        [  98,  73,  72 ],
        [  75,  84,  85 ],
        [  92, 100,  96 ],
        [  96,  92,  90 ],
        [  99,  76,  91 ],
        [  75,  82,  88 ],
        [  90,  94,  94 ],
        [  54,  84,  87 ],
        [  92,  89,  62 ],
        [  88,  94,  97 ],
        [  42,  99,  80 ],
        [  70,  98,  70 ],
        [  94,  78,  83 ],
        [  52,  73,  87 ],
        [  94,  88,  72 ],
        [  70,  73,  80 ],
        [  95,  84,  90 ],
        [  95,  88,  84 ],
        [  75,  97,  89 ],
        [  49,  81,  86 ],
        [  83,  72,  80 ],
        [  75,  73,  88 ],
        [  79,  82,  76 ],
        [ 100,  77,  89 ],
        [  88,  63,  79 ],
        [ 100,  50,  86 ],
        [  55,  96,  84 ],
        [  92,  74,  77 ],
        [  97,  50,  73 ],
])

is. When I run it,

[0 0 0 1 1 2 1 0 2 1 0 2 2 0 0 1 1 0 1 0 2 0 2 0 2 2 1 1 2 1 1 2 2 2 1 2 2]
Pearsonr k-means:11.0138161182

I was able to confirm that it was working firmly! Yattane!

By the way, the normal k-means, Euclidean distance version is

[1 1 0 1 1 2 1 1 2 1 1 1 2 1 1 0 2 1 0 0 2 0 2 0 1 1 1 0 2 2 2 2 2 2 0 2 2]
========================================
Normal k-means:0.016499042511

is. You can see if there is some data with different clusters of 0 and 1.

The execution time is ... Unfortunately, there is a difference of nearly 600 times ...

Since the number of functions to call has increased, it can't be helped. Around this time, if you devise an implementation, there is a good chance that it will be a little faster. (Erotic people who will speed up. Recruitment)

I haven't seen the difference because I haven't used it for document classification, but since it has been proved in the paper, Pearson's correlation coefficient will surely give good results (should be ...).

Summary

The k-means distance function is not a great justice for Euclidean distance
Using other distance functions in scikit-learn is a bit painful
Slow

By the way ...

I also implemented the Kullback-Leibler distance. However, unfortunately, it stops with an error.

When measuring the distance between X and Y, a negative factor is inevitably generated and the distance becomes inf, so it cannot even be initialized in the first place.

I wonder what to do with the negative factors, but what should I do? I didn't mention it in the paper.

from sklearn.preprocessing import normalize
def averaged_kullback_leibler(X, Y):
    #norm_X = normalize(X=X, norm='l1')[0]
    #norm_Y = normalize(X=Y, norm='l1')[0]
    norm_X = X
    norm_Y = Y

    pie_1 = norm_X / (norm_X + norm_Y)
    pie_2 = norm_Y / (norm_X + norm_Y)
    M = (pie_1 * norm_X) + (pie_2 * norm_Y)

    KLD_averaged = (pie_1 * scipy.stats.entropy(pk=norm_X, qk=M)) + (pie_2 * scipy.stats.entropy(pk=norm_Y, qk=M))

    return KLD_averaged

def sub_KL_divergence(X, row_index_1, row_index_2):
    #kl_distance = scipy.stats.entropy(pk=X[row_index_1], qk=X[row_index_2])
    kl_distance = averaged_kullback_leibler(X[row_index_1], X[row_index_2])
    return (row_index_1, row_index_2, kl_distance)

def copy_inverse_index(row_col_data_tuple):
    return (row_col_data_tuple[1], row_col_data_tuple[0], row_col_data_tuple[2])

def KLD_distances(X, Y=None):
    if Y==None:
        #Only X is entered and X is 2d-For array
        row_combinations = list(combinations(range(0, len(X)), 2))
        kl_divergence_set = [sub_KL_divergence(X, index_set_tuple[0], index_set_tuple[1]) for index_set_tuple in row_combinations]
        matrix_source_data = kl_divergence_set + map(copy_inverse_index, kl_divergence_set)

        row = [t[0] for t in matrix_source_data]
        col = [t[1] for t in matrix_source_data]
        data = [t[2] for t in matrix_source_data]

        kl_distance_matrix = scipy.sparse.csr_matrix((data, (row, col)), (X.shape[0], X.shape[0]))
        return kl_distance_matrix.toarray()
    elif len(X.shape)==1 and len(Y.shape)==2:
        #When X is one sample and Y is multiple samples
        #return is an array of 1 sample
        kl_divergence_set = numpy.array([averaged_kullback_leibler(X, Y[y_sample_index]) for y_sample_index in range(0, Y.shape[0])])

        return kl_divergence_set
    elif len(X.shape)==2 and len(Y.shape)==2:
        #X on one line[i]And Y[j] in n(Y)Record and return the distance
        # X[i]A function that calculates all distances between and Y
        for x_i in range(0, X.shape[0]):
            xx = X[x_i]
            for y_sample_index in range(0, Y.shape[0]):
                #print(xx)
                #print(Y[y_sample_index])
                print(averaged_kullback_leibler(xx, Y[y_sample_index]))

        kld_x_and_all_y = lambda X, Y: numpy.array([averaged_kullback_leibler(X, Y[y_sample_index]) for y_sample_index in range(0, Y.shape[0])])
        kl_divergence_set = numpy.array([kld_x_and_all_y(X[x_i], Y) for x_i in range(0, X.shape[0])])
        return kl_divergence_set
    else:
        raise Exception("Exception case caused")

[PYTHON] Define your own distance function with k-means of scikit-learn