You can do it right away with apache lucene. You can do it in one line. There are the Levenshtein distance method and the Jaro Winkler distance method (although there are others).
How many times should I edit = distance. When replacing "BitCoin Core" with "BitCoin Cash"
1st time: "BitCoin C [a] re" Second time: "BitCoin Ca [s] e" Third time: "BitCoin Cas [h]"
Therefore, the distance is "3".
In this case, the number of characters is 12 characters. 9 out of 12 characters do not need to be edited The score is 9/12 = 3/4 = 0.75, which is 75 points.
** Generally, it is said to be easy to use for spell checking and robbery checking. ** **
I also measure the similarity, For example, the similarity is calculated as if there are characters that can be replaced within a certain range.
In addition, how well the prefixes match is also taken into account when calculating the similarity.
--In the case of "1234567890" and "0004567890", the score is about 80 points. --In the case of "1234567890" and "1234567111", the score is about 94 points.
** Generally, it is said to be effective for checking spelling mistakes **
Only rely on Lucene. At Maven.
pom.xml
<dependency>
<artifactId>lucene-core</artifactId>
<groupId>org.apache.lucene</groupId>
<version>5.1.0</version>
</dependency>
<dependency>
<artifactId>lucene-analyzers</artifactId>
<groupId>org.apache.lucene</groupId>
<version>3.6.1</version>
</dependency>
<dependency>
<artifactId>lucene-spellchecker</artifactId>
<groupId>org.apache.lucene</groupId>
<version>3.6.1</version>
</dependency>
sample
import org.apache.lucene.search.spell.JaroWinklerDistance;
import org.apache.lucene.search.spell.LevensteinDistance;
/**
*Sample to calculate the similarity score of a character string
* @author ryutaro_hakozaki
*/
public class ExecStringSimilaritySample {
public static void main(String argv[]){
System.out.println(
"A score comparing "BitCoin Core" and "BitCoin Cash" at the Levenshtein distance== "
+ getSimilarScoreByLevenshteinDistance("BitCoin Core", "BitCoin Cash"));
System.out.println(
"Jaro Winkler Distance score comparing "BitCoin Core" and "BitCoin Cash"== "
+ getSimilarScoreByJaroWinklerDistance("BitCoin Core", "BitCoin Cash"));
}
/**
*Determine the similarity of strings by Levenshtein distance
* @param s1
* @param s2
* @return
*/
private static int getSimilarScoreByLevenshteinDistance(String s1, String s2){
//Input check is omitted
LevensteinDistance dis = new LevensteinDistance();
return (int) (dis.getDistance(s1, s2) * 100);
}
/**
*Jaro Winkler Determines string similarity by distance
* @param s1
* @param s2
* @return
*/
private static int getSimilarScoreByJaroWinklerDistance(String s1, String s2){
//Input check is omitted
JaroWinklerDistance dis = new JaroWinklerDistance();
return (int) (dis.getDistance(s1, s2) * 100);
}
}
Execution result
A score comparing "BitCoin Core" and "BitCoin Cash" at the Levenshtein distance== 75
Jaro Winkler Distance score comparing "BitCoin Core" and "BitCoin Cash"== 95
Recommended Posts