Calculate the similarity score of strings with JAVA

You can do it right away with apache lucene. You can do it in one line. There are the Levenshtein distance method and the Jaro Winkler distance method (although there are others).

Levenshtein distance method

How many times should I edit = distance. When replacing "BitCoin Core" with "BitCoin Cash"

1st time: "BitCoin C [a] re" Second time: "BitCoin Ca [s] e" Third time: "BitCoin Cas [h]"

Therefore, the distance is "3".

In this case, the number of characters is 12 characters. 9 out of 12 characters do not need to be edited The score is 9/12 = 3/4 = 0.75, which is 75 points.

** Generally, it is said to be easy to use for spell checking and robbery checking. ** **

Jaro Winkler distance method

I also measure the similarity, For example, the similarity is calculated as if there are characters that can be replaced within a certain range.

In addition, how well the prefixes match is also taken into account when calculating the similarity.

--In the case of "1234567890" and "0004567890", the score is about 80 points. --In the case of "1234567890" and "1234567111", the score is about 94 points.

** Generally, it is said to be effective for checking spelling mistakes **

Implementation

Only rely on Lucene. At Maven.

pom.xml


        <dependency>
            <artifactId>lucene-core</artifactId>
            <groupId>org.apache.lucene</groupId>
            <version>5.1.0</version>
        </dependency>
        <dependency>
            <artifactId>lucene-analyzers</artifactId>
            <groupId>org.apache.lucene</groupId>
            <version>3.6.1</version>
        </dependency>
        <dependency>
            <artifactId>lucene-spellchecker</artifactId>
            <groupId>org.apache.lucene</groupId>
            <version>3.6.1</version>
        </dependency>

sample


import org.apache.lucene.search.spell.JaroWinklerDistance;
import org.apache.lucene.search.spell.LevensteinDistance;

/**
 *Sample to calculate the similarity score of a character string
 * @author ryutaro_hakozaki
 */
public class ExecStringSimilaritySample {
    
    public static void main(String argv[]){
        
        System.out.println(
                "A score comparing "BitCoin Core" and "BitCoin Cash" at the Levenshtein distance== " 
                        + getSimilarScoreByLevenshteinDistance("BitCoin Core", "BitCoin Cash"));

        System.out.println(
                "Jaro Winkler Distance score comparing "BitCoin Core" and "BitCoin Cash"== " 
                        + getSimilarScoreByJaroWinklerDistance("BitCoin Core", "BitCoin Cash"));

        
    }
    
    /**
     *Determine the similarity of strings by Levenshtein distance
     * @param s1
     * @param s2
     * @return 
     */
    private static int getSimilarScoreByLevenshteinDistance(String s1, String s2){
        
        //Input check is omitted
        LevensteinDistance dis =  new LevensteinDistance();
        return (int) (dis.getDistance(s1, s2) * 100);
    }
    
    /**
     *Jaro Winkler Determines string similarity by distance
     * @param s1
     * @param s2
     * @return 
     */
    private static int getSimilarScoreByJaroWinklerDistance(String s1, String s2){
        
        //Input check is omitted
        JaroWinklerDistance dis =  new JaroWinklerDistance();
        return (int) (dis.getDistance(s1, s2) * 100);
    }
    
}

Execution result


A score comparing "BitCoin Core" and "BitCoin Cash" at the Levenshtein distance== 75
Jaro Winkler Distance score comparing "BitCoin Core" and "BitCoin Cash"== 95

Recommended Posts

Calculate the similarity score of strings with JAVA
CI the architecture of Java / Kotlin applications with ArchUnit
Monitor the internal state of Java programs with Kubernetes
Check the behavior of Java Intrinsic Locks with bpftrace
The story of making dto, dao-like with java, sqlite
Replace only part of the URL host with java
Sort strings functionally with java
Be sure to compare the result of Java compareTo with 0
[Java] Delete the elements of List
Follow the link with Selenium (Java)
The origin of Java lambda expressions
Try Hello World with the minimum configuration of Heroku Java spring-boot
A story about hitting the League Of Legends API with JAVA
The point of addiction when performing basic authentication with Java URLConnection
Overwrite upload of file with the same name with BOX SDK (java)
Is the version of Elasticsearch you are using compatible with Java 11?
Get the result of POST in Java
Check the contents of the Java certificate store
Check the contents of params with pry
Examine the memory usage of Java elements
The story of making a game launcher with automatic loading function [Java]
[Java] Get the day of the specific day of the week
Memo: [Java] Check the contents of the directory
Compare the elements of an array (Java)
How to convert an array of Strings to an array of objects with the Stream API
[day: 5] I summarized the basics of Java
What are the updated features of java 13
Easily measure the size of Java Objects
Looking back on the basics of Java
[Java] Precautions when comparing character strings with character strings
Output of the book "Introduction to Java"
About the treatment of BigDecimal (with reflection)
[Java] Comparison of String type character strings
Check the domain by checking the MX record of the email address with java
The story of writing Java in Emacs
Format the contents of LocalDate with DateTimeFormatter
[Java] Check the number of occurrences of characters
[Java] [Spring] Test the behavior of the logger
Try using the Wii remote with Java
[Java] Get MimeType from the contents of the file with Apathce Tika [Kotlin]
[Java] Get the date with the LocalDateTime class
Increment with the third argument of iterate method of Stream class added from Java9
[Code] Forcibly breaks through the C problem "* 3 or / 2" of [AtCoder Problem-ABC100] with Java [Code]
Validate the identity token of a user authenticated with AWS Cognito in Java
[Java] Calculate the day of the week from the date (Calendar class is not used)
The story of low-level string comparison in Java
[Java] Handling of JavaBeans in the method chain
JAVA: jar, aar, view the contents of the file
[java] Summary of how to handle character strings
The story of making ordinary Othello in Java
[Android] [Java] Manage the state of CheckBox of ListView
Verify the contents of the argument object with Mockito
About the description order of Java system properties
[LeJOS] Let's control the EV3 motor with Java
About the idea of anonymous classes in Java
The order of Java method modifiers is fixed
[Java] Set the time from the browser with jsoup
[Java] Access the signed URL of s3 (signed version 2)
The story of learning Java in the first programming
Understanding the MVC framework with server-side Java 1/4 View
[Java] Get the length of the surrogate pair string