Things to watch out for when using Deeplearning4j Kmeans

Conclusion

Do not use `cosinesimilarity``` for `distanceFunction``` when using Kmeans in deeplearning4j.

Reason

deeplearning4j has a Kmeans function. It is used in the following form.

KMeansClustering kmc = KMeansClustering.setup(num, iter, distanceFunction);
ClusterSet cs = kmc.applyTo(pointsLst);

Where `num``` is the number of clusters, ```iter``` is the number of iterations (` `10``` is often used), and destanceFunction``` is the distance function ( You can specify ʻeuclidean```, `` manhattan, or `` `cosinesimilarity).

Well, here is a land mine. If you specify `cosinesimilarity``` for destanceFunction ``, you will not get the results you expected.

As you may have noticed, only `` `cosinesimilarityis similar. The other two are distances. That is, ** cosinesimilarity``` has a higher similarity as the value (MAX is 1), and the remaining two have a higher similarity as the value is smaller **.

The program of deeplearning4j looks like this. At `` `ClusterSet.class```

    public Pair<Cluster, Double> nearestCluster(Point point) {

        Cluster nearestCluster = null;
        double minDistance = Float.MAX_VALUE;

        double currentDistance;
        for (Cluster cluster : getClusters()) {
            currentDistance = cluster.getDistanceToCenter(point);
            if (currentDistance < minDistance) {
                minDistance = currentDistance;
                nearestCluster = cluster;
            }
        }

        return new Pair<>(nearestCluster, minDistance);

    }

This function calculates the distance between the existing cluster and yourself, and classifies it into the cluster with the shortest distance. It's a very legitimate process, but it's ridiculous here because the interpretation of the value is reversed only for `` `cosinesimilarity```.

in conclusion

If you really want to use cosinesimilarity, you have no choice but to create an extends class at present. I thought I'd send a pull request with exception handling, but I quit because the code was dirty. I'm wondering if it's a relief, but what should I do?

Postscript

It was registered in issues. https://github.com/deeplearning4j/deeplearning4j/issues/2361

Apparently, it seems to be tidied up because it is "normal operation". Certainly ... it's normal operation, but it's not "correct processing". If you want to disable it or use it forcibly, it is better to set `cosinesimilarity``` to cosinedistance``` and accept only `` s [cosinedistance] = 1-s [cosinesimilarity] `` . I think.

Recommended Posts

Things to watch out for when using Deeplearning4j Kmeans
Things to watch out for when creating a framework
Things to watch out for in Java equals
Things to watch out for in future Java development
Things to keep in mind when using if statements
Things to be aware of when using devise's lockable
[Ransack] Watch out for ransackable_scopes!
Things to keep in mind when using Sidekiq with Rails
Things to note when using Spring AOP in Jersery resource classes
Things to keep in mind when using Apache PDFBox® with AWS Lambda
Watch out for embedded variables in S2Dao
Things to be aware of when writing Java
Things to keep in mind when committing to CRuby