[PYTHON] Find out the mystery change of Pokédex description by Levenshtein distance

Introduction

It is common sense for trainers who have been fond of Pokemon for a long time, but the explanation of Pokédex differs depending on the title of the game.

Pikachu

Pikachu: Red / Green It has a small cheek on the cheek. When you're in a pinch Pikachu: Blue It is said that if there is a lot of babies, there will be a lot of sickness and sickness.

In this way, one of the attractions of Pokemon is that even if it is the same Pokemon, new discoveries can be made by playing it with a different game title.


On the other hand, there are differences in the explanations of these pictorial books for each title.

Bulbasaur

Bulbasaur: Red / Green ** Since I was born, there have been a lot of seeds in the plant, and it's a little big. ** **

Bulbasaur: Blue ** It is said that the body has a mysterious seed in it since it was born. ** **

The text has changed a little, but the content is almost the same, only the wording is different. Thank you very much.


Even though the written contents are almost the same, those with subtle changes in the text are called ** mystery changes **, and we will discover them efficiently.

Levenshtein distance (editing distance)

As of May 2020, there are 890 types of Pokemon. There are many at this point, but there are also 33 more titles (probably).

'Red / Green','Blue','Pikachu', 'Gold','Silver','Crystal', 'Ruby',' Sapphire',' Fire Red',' Leaf Green',' Emerald', 'Diamond',' Pearl',' Platinum', 'Heart Gold',' Soul Silver', 'Black','White','Black 2 / White 2', 'X',' Y',' Omega Ruby',' Alpha Sapphire', 'Sun','Moon','Ultra Sun','Ultra Moon', "Let's Go! Pikachu / Let's Go! Eevee", 'Sword','Shield'

The new Pokemon requires only about two types of pictorial book explanations, but the older Pokemon have more types of pictorial book explanations, and the cumulative total is about 15,000 sentences **.

When it comes to checking these for all Pokemon by title, ** Even if one Pokemon has 20 titles, there will be 20 * 19/2 = 190 ways. ** ** It's hard to check this for all Pokemon, so Use the ** Levenshtein distance **, an indicator of how different the two strings are.

This indicator is

  1. ** Delete characters **
  2. ** Add characters **
  3. ** Change to other characters ** How many times can one of these be done to convert to the other string? Is expressed by the distance. It is a simple one.

For example

  1. Pikachu
  2. Raichu

in the case of,

**Pikachu ** Rye ** Chu

The Levenshtein distance is 2 ** because you can replace it with two letters.

  1. Rhyhorn
  2. Rhydon

In the case of

Sai ** Ho ** n Sai ** do ** n

If you change the "ho" between the rhyhorns to "do" and delete the "-", it becomes "Rhydon", so the Levenshtein distance is also 2 **.


Implementation

Implemented with reference to the Levenshtein distance algorithm. We also added a process to reverse the shortest procedure independently. ** Reference **: [Technical explanation] You can see similar character strings! How to calculate the Levenshtein distance and the Jaro Winkler distance https://mieruca-ai.com/ai/levenshtein_jaro-winkler_distance/

import pandas as pd
import numpy as np

def LevenshteinDistance(s, t, verbose = False):
    
    def preprocessing(text):
        re_html = r'(<.+?>)'
        text = text.replace(" ","_").replace("!","!")
        text = re.sub(re_html,"",text)
        return text
    
    #Unify blank half-width and full-width
    s = preprocessing(s)
    t = preprocessing(t)
    #Allow the shortest distance to be calculated back
    back_trace_dict={0:"UP",
                    1:"LEFT",
                    2:"DIAG"}

    #String length+Create a two-dimensional array of 1
    dis_array = np.zeros((len(s)+1,len(t)+1),dtype="int")
    dis_array[:,0] = np.arange(len(s)+1)
    dis_array[0,:] = np.arange(len(t)+1)
    
    #For back calculation
    back_trace = dis_array.copy().astype("str")
    back_trace[:,0] = "DIAG"
    back_trace[0,:] = "DIAG"

    
    for i in range(1,len(s)+1):
        for j in range(1,len(t)+1):
            d1 = dis_array[i-1,j] + 1 #Delete
            d2 = dis_array[i,j-1] + 1 #Insert
            
            #Replace if the characters are different. If they are the same, leave it as it is
            d3 = dis_array[i-1,j-1] + (0 if s[i-1] == t[j-1] else 1)
            
            #Use the one with the shortest distance among delete / insert / replace
            dis_array[i,j] = min(d1,d2,d3)
            back_trace[i,j] = back_trace_dict[np.argmin([d1,d2,d3])]
    

    #Processing that reverses the process of minimum edit distance
    if verbose:
        def getchar(text,idx):
            return "_" if idx <0 else text[idx]
        
        #Processing that reverses the process of minimum edit distance
        s = " " + s
        t = " " + t      
        s_,t_ = len(s)-1,len(t)-1
        s_list = []
        t_list = []

        trace_list = []
        while s_ >= 0 and t_ >= 0:
            trace = back_trace[s_,t_]
            trace_list.append(trace)
            if trace == "DIAG":
                if s[s_] != t[t_]:
                    wrapper = "x"
                else:
                    wrapper = ""
                s_list.append(getchar(s,s_) + wrapper)
                t_list.append(getchar(t,t_) + wrapper)
                s_ -= 1
                t_ -= 1
                
            elif trace == "LEFT":
                s_list.append("X")
                t_list.append(getchar(t,t_))
                t_ -= 1
                
            else:
                s_list.append(getchar(s,s_))
                t_list.append("X")
                s_ -= 1            
        
        return "".join(s_list[::-1]) + "\n" + "".join(t_list[::-1])
    
    return dis_array[-1,-1]

Calculation result

str1 = "For a while after I was born, I got a great deal from the seeds of the middle."
str2 = "For a while after being born, I picked up the seeds that were stuck in the seeds."

lv_dist = LevenshteinDistance(str1,str2,verbose=True)
print(f"Levenshtein distance:{lv_dist}")

'''
output:
It's been _for a while _ for a while _ in the middle of _ x ne x or x et al.
I was born XX _ for a while _ X _ in the middle _ _ _ x _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Levenshtein distance:21

'''

Calculate for all picture book sentences of all Pokemon

Calculate and compare the Levenshtein distance between the pictorial books of each Pokemon.

Column names for game titles as shown below. I prepared a file with the line as the name of Pokemon.

First, the pictorial book explanations for one type of Pokemon are compared.

Implementation

import matplotlib.pyplot as plt
import japanize_matplotlib
import seaborn as sns
from itertools import combinations,product
import os

df = pd.read_csv("Picture book text.csv",encoding="utf-8-sig").applymap(lambda x:np.nan if x == "__" else x)

def compare_pokedex(poke_num,threshold=30):
    
    row = df.query("num == @poke_num").iloc[0].dropna()
    
    text_array=row.values[2:].copy()  #num,Exclude name column
    indices = range(len(text_array))

    #Create a combination of all titles
    comb = list(product(indices,indices))
    col1_list =[]
    col2_list = []
    lev_list = []

    for (col1,col2) in comb:
        if col1 > col2:
            lev_dist=LevenshteinDistance(text_array[col1],text_array[col2])
            col1_list.append(col1)
            col2_list.append(col2)
            lev_list.append(lev_dist)

            col1_list.append(col2)
            col2_list.append(col1)
            lev_list.append(lev_dist)

        elif col1 == col2:
            col1_list.append(col1)
            col2_list.append(col2)
            lev_list.append(0)

    lev_df = pd.DataFrame({"col1":col1_list,
                          "col2":col2_list,
                          "lev":lev_list})
    lev_pivot=pd.pivot_table(index="col1",columns="col2",values="lev",data=lev_df)

    lev_pivot.index=row[2:].index
    lev_pivot.columns=row[2:].index
    
    sns.clustermap(lev_pivot, method='ward', metric='euclidean',annot=True)
    
    lev_stack=pd.DataFrame(lev_pivot.stack().sort_values())
    lev_stack.reset_index(inplace=True)
    lev_stack.columns = ["col1","col2","lev"]

    lev_stack["col1_text"] = lev_stack["col1"].map(lambda x:row[x])
    lev_stack["col2_text"] = lev_stack["col2"].map(lambda x:row[x])
    lev_stack = lev_stack.query("col1_text != col2_text")
    lev_stack.drop_duplicates(subset=["col1_text","col2_text"],inplace=True)
    
    #Display only for character strings below threshold
    similar_df = lev_stack.query("lev > 0 and lev < @threshold & col1 > col2")
    for i,sim in similar_df.iterrows():
        dist = sim["lev"]
        idx1 = sim["col1"]
        idx2 = sim["col2"]

        print("Levenshtein distance:{}".format(dist))
        print(",".join(list(row[row == row[idx1]].index)))
        print(sim["col1_text"])
        print(",".join(list(row[row == row[idx2]].index)))
        print(sim["col2_text"])
        print("-----------")

After calculating the Levenshtein distance between all titles by combination, Check the cluster of sentences closer to the seaborn cluster map. A group filled in black and marked with 0s means that the exact same text is used.

How far is suspicious?

・ Short-distance text (Levenshtein distance: 19)

It's been _for a while _ for a while _ in the middle of _ x ne x or x et al.
I was born XX _ for a while _ X _ in the middle _ _ _ x _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

** Differences in particles such as "kara" and "ha" **. ** Notation difference ** of "Tane" and "Seed". "Take" "Get" ** Verb difference **.

The Levenshtein distance is as high as 19, but almost the same information is written except that the content is different.

・ Middle-distance sentence example (Levenshtein distance: 30)

Born from xXX, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x Te_big_sodatsu.
From the time of birth x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Sodatsu.

It looks almost the same information as the short distance, but Whereas the above sentence says, "I grow up with seeds on my back." In the text below, the information "Get nutrition from seeds" is added.

・ Long-distance sentence example (Levenshtein distance: 56)

_ X _ x _ x _ x _ x _ x _ x _ x _ x _ x _ x x x x x x x x _ x _ x _ x _ x _ x _ x _ x _ x _ x _ x _ x _ x _ x _ x _ That's x, x, x. x
x What x to x to x to x x x _ x to x to _ XX Eat x to x to x to x_x to x to x to x! x_x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

At the Levenshtein distance of 56, the text was almost different. It is different except for the noun part, the particle part, and the part where the space overlaps.


If you look at various things, you can find ** mystery change ** in the picture book description if you search by ** Levenshtein distance 20 or less **. It turned out that.

Confirmation of Levenshtein distance aggregation result

This time, I got the picture book explanation from the Pokemon wiki. https://wiki.xn--rckteqa2e.com/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8

Average value for each generation

The average Levenshtein distance for each generation is as follows.

The Levenshtein distance in the Pokédex description of the 3rd generation seems to be particularly large, but this was simply the length of the text for each generation.

Isn't it a simple discovery that "the text of the Pokédex that first appeared in the 3rd generation is long"? I couldn't find any articles or SNS accounts that mention this when I searched.

When the Levenshtein distances (combinations) of all pictorial book explanations were totaled, the combinations of 20 or less, which are candidates for ** mystery change **, were about ** 0.47% ** of the total.

Introducing a part of the picture book explanation that had a mystery change

Even though it is 0.47%, there are 110 pairs, so I will introduce only a few.

In addition, there were some that made a difference depending on the presence or absence of the following blanks.

Rhydon Levenshtein distance: 1 I'm protecting my brush with an armor-like brush. 2000 You can go in any magma. (Blue, Leaf Green, X) I'm protecting my brush with an armor-like brush. 2000 You can go in any magma. (shield)

It's hard to understand, but there is a distance of 1 depending on the presence or absence of a blank.

Not all Pokemon titles are in my possession, so it's hard to find out if the blank differences really exist in the game or if it's a mistake on the Pokemon wiki. ** (Someone who has all Pokemon titles and collects all Pokemon !! Please verify !!) **


Golbat Distance 20 ** It's a rush ** Kiva ** and ** ** I bite and scoop up ** 300 Ccy ** at a time. (Red / Green, Fire Red) ** If you insert ** Kiva ** into the food, you will get rid of ** 300 Ccy ** Ketsueki ** together. (crystal)


Parasect Distance 19 Sprinkle the mushrooms from the mushrooms. ** But in China, this is ** Kanpoyaku **. (Red / Green, Fire Red) Sprinkle some mushrooms from the mushrooms. ** If you collect all the information, you will get ** Kanpo finally **. (Moon)

Windy Distance 5 ** China ** Legendary Pokemon. There are many things that are captivated by the lightly lightly. (Pikachu) ** Legendary Pokémon. There are many things that are captivated by the lightly lightly. (Pikabui)

When it comes to titles released in recent years, the notation has been changed from "China". Consideration for expansion into China? (Note that the real-world place names ** India ** and ** Everest ** have also appeared, but will they change in the future?)


Banette Distance 19 The stuffed toy ** has become a ** onnen ** and a ** yadori ** Pokemon. I'm looking for a child who has lost my life. (Diamond, Pearl, Platinum, Black, White, X) ** Envy ** has accumulated ** on the stuffed toy ** and has become a ** Pokemon. I'm looking for a child who's gone. (Black 2 / White 2)

Plush toy → Nuigurumi. Onnen → envy. Yadori → Accumulate. I'm looking for → I'm looking for There are 20 distances, but a mystery change like a model


Rhydon Distance 7 I've been able to get away with just the back. If you get used to it with horns, you will have a hard time. ** (Red / Green, Fire Red, Y) I've been able to get away with just the back. If you use horns **, you will open up your ears. ** (Sword)

Did the sword become active in the passive form until Y? why? ?? ??


Clamperl Distance 1 When it's one or more **, it's time to make a mysterious evolution that evolves psychopower. (Heart Gold, Soul Silver) When it's one evolution at a time, it creates a mysterious evolution that evolves psychopower. (X)

"One time" → "1st floor"


Modify the one that used the actual place name such as "Chugoku". If so, I understand that many of the mystery changes were unavoidable because I was told "Please do not use the same password as last time" ... Only the changes were made, and the intention was unknown.

(Maybe) wrong hypothesis: to break cleanly? ??

The width of the picture book is different for each Pokemon title, It seems that some characters have different numbers of characters per line.

--18 characters per line

--24 characters per line

Unfortunately, most of the titles are in my parents' house, so I haven't been able to verify them, but I searched for and counted the ones that used the characters from end to end on the net. Some titles, such as HGSS and BW, have half-width spaces.

generation Number of characters (double-byte) Maximum number of characters
(Including line breaks)
Red green blue 18 56
Gold and silver 18 56
RSE 24 74
FRLG 18 56
DP 18 56
HGSS 21(Variable width?) 65?
BW 18(Variable width?) 56?
XY,ORAS 24 74
SM,USUM 24 74
Sword shield 18 56

XY, ORAS, SM, USUM are 24 characters, but 24 characters per line are fully used only in the explanation of the 3rd generation RSE pictorial book and the text modified by the remake version of ORAS. there were.

Strange. All titles are almost the same ...

The mystery change calculated earlier is a combination of small Levenshtein titles with the same number of characters in one line, so there is no need to adjust the line feed position.

By the way, the titles that tend to use the same pictorial book description are as follows. (The number next to the title. "18_Pikachu" is 18 characters per line in the Pikachu version. means)

Exceptionally, it uses 24 characters per line. Ruby sapphire emerald Omega Ruby Alpha Sapphire You can see that the 5 titles of are solidified.

In addition, after the 7th generation Sun Moon, there is no explanation in the national picture book of Pokemon other than Pokemon that can be obtained with each title. https://wiki.xn--rckteqa2e.com/wiki/%E3%83%9D%E3%82%B1%E3%83%A2%E3%83%B3%E3%81%9A%E3%81%8B%E3%82%93


Summary

It's quite indigestion, but I'll summarize it.

--By using the Levenshtein distance, we were able to efficiently find the mystery change. --The text of the description of the first picture book in Ruby Sapphire Emerald is longer than that of other titles. ――The width of the number of characters in one line of each title is often 18 characters and 24 characters, but only RSE (Ruby Sapphire Emerald) is fully used for 24 characters. ――It was confirmed that the one using the actual place name such as "Chugoku" was changed to another expression. ――In conclusion, I'm not sure. ――I'm not sure.

Unverified hypothesis about the reason for the mystery change

――The influence is that the introduction of Kanji after the 5th generation BW. --Consideration for overseas expansion. (Pokédex is designed so that the information does not increase even if it is translated into English etc.) -** Changed to make it easier to read in the mood of the person in charge. ** **

Recommended Posts

Find out the mystery change of Pokédex description by Levenshtein distance
Find the Levenshtein Distance with python
I found out by analyzing the reviews of the job change site! ??
Script to change the description of fasta
Find out the day of the week with datetime
Find out the location of Python class definition files.
Find out the location of packages installed with pip
Maya | Find out the number of polygons in the selected object
Find out the apparent width of a string in python
Write a python program to find the editing distance [python] [Levenshtein distance]
Python --Find out number of groups in the regex expression
Find the diameter of the graph by breadth-first search (Python memory)
Change the theme of Jupyter
Change the style of matplotlib
Find out the age and number of winnings of prefectural governors nationwide
Find the distance from latitude and longitude (considering the roundness of the earth).
Find the ratio of the area of Lake Biwa by the Monte Carlo method
Find the definition of the value of errno
Change the Python version of Homebrew
Change the suffix of django-filter / DateFromToRangeFilter
Unravel the mystery of matplotlib specgram
How to find out the number of CPUs without using the sar command
I tried to find the optimal path of the dreamland by (quantum) annealing
Find the minimum value of a function by particle swarm optimization (PSO)
Find out the name of the method that called it from the method that is python