Name identification using python

Means of name identification

Use the Levenshtein distance. "Levenshtein distance" is also known as "editing distance" It is the minimum number of steps required to transform one character string into the other by "inserting", "deleting", and "replacement" of one character.

For example, from cat to cut cat c tcut to 2 of" ʻa delete, ʻuinsert". It is easy to think that the Levenshtein distance is1` because you can use permutation.

For more information [Wiki](https://en.wikipedia.org/wiki/%E3%83%AC%E3%83%BC%E3%83%99%E3%83%B3%E3%82%B7%E3 See% 83% A5% E3% 82% BF% E3% 82% A4% E3% 83% B3% E8% B7% 9D% E9% 9B% A2). In this tutorial, we will finally identify the names while doing some experiments.

Experiment 1. Let's use "Levenshtein distance"

test0


import Levenshtein

target = ['oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
cates = ['kome']

for val in target:
    nearL_flag = False
    for cate in cates:
        if Levenshtein.distance(val, cate) < 1:   #If the reference Levenshtein distance is less than 1, that is, if it is even slightly different, throw it into categories.
            nearL_flag = True
    if not nearL_flag:
        cates.append(val)
cates

output


['kome', 'oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']

The program is such that everything with a reference Levenshtein distance of 1 or more enters categories. We will remodel it based on this. First, gradually increase the distance and observe what happens to the output.

Experiment 1. Increase the reference Levenshtein distance (tolerate ambiguity)

test1


import Levenshtein

target = ['oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']

for i in range(10):
    cates = ['kome']
    for val in target:
        nearL_flag = False
        for cate in cates:
            if Levenshtein.distance(val, cate) < i:
                nearL_flag = True
        if not nearL_flag:
            cates.append(val)
    print(i, cates)

output



0 ['kome', 'oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
1 ['kome', 'oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
2 ['kome', 'oniku_a', 'yasai_a']
3 ['kome', 'oniku_a', 'yasai_a']
4 ['kome', 'oniku_a', 'yasai_a']
5 ['kome', 'oniku_a', 'yasai_a']
6 ['kome', 'oniku_a', 'yasai_b']
7 ['kome', 'yasai_a']
8 ['kome']
9 ['kome']

From this output result, it can be seen that ambiguity is tolerated as the Levenshtein distance is increased. For example, whether to consider 'oniku_a' and'oniku_b' as the same classification. If the reference Levenshtein distance is small, it is "considered different" If the standard Levenshtein distance is large, "Well, let's consider it the same".

Also, from the above output result, it seems to be good to set the initial list of categories to ['kome','oniku','yasai'] instead of ['kome'].

Experiment 2. Change the initial value to ['kome','oniku','yasai'] and try to identify the name.

test2


import Levenshtein

target = ['oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']

for i in range(10):
    cates = ['kome', 'oniku', 'yasai']
    for val in target:
        nearL_flag = False
        for cate in cates:
            if Levenshtein.distance(val, cate) < i:
                nearL_flag = True
        if not nearL_flag:
            cates.append(val)
    print(i, cates)

output


0 ['kome', 'oniku', 'yasai', 'oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
1 ['kome', 'oniku', 'yasai', 'oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
2 ['kome', 'oniku', 'yasai', 'oniku_a', 'yasai_a']
3 ['kome', 'oniku', 'yasai']
4 ['kome', 'oniku', 'yasai']
5 ['kome', 'oniku', 'yasai']
6 ['kome', 'oniku', 'yasai']
7 ['kome', 'oniku', 'yasai']
8 ['kome', 'oniku', 'yasai']
9 ['kome', 'oniku', 'yasai']

From the above, for this target, cates initial value: ['kome','oniku','yasai'], it was found that the name can be identified with the Levenshtein distance less than 3.

Then, finally, it is a name identification program.

Name identification program

nayose


import Levenshtein

target = ['oniku_a', 'oniku_b', 'oniku_c', 'yasai_a', 'yasai_b', 'yasai_c']
cates = ['kome', 'oniku', 'yasai']

nayose = []

for val in target:
    minL = 100
    afterNayose = 'dummy'
    for cate in cates:
        tmp_distance = Levenshtein.distance(val, cate)
        if tmp_distance < minL:
            minL = tmp_distance
            afterNayose = cate
    nayose.append(afterNayose)

[(before, after, Levenshtein.distance(before, after)) for before, after in zip(target, nayose)]

output


[('oniku_a', 'oniku', 2),
 ('oniku_b', 'oniku', 2),
 ('oniku_c', 'oniku', 2),
 ('yasai_a', 'yasai', 2),
 ('yasai_b', 'yasai', 2),
 ('yasai_c', 'yasai', 2)]

The way to read the output is (before, ʻafter, before and after Levenshtein distance). You can certainly confirm that the Levenshtein distance is less than 3` and the name can be identified.

Recommended Posts

Name identification using python
Start using Python
Scraping using Python
Operate Redmine using Python Redmine
Data cleaning using Python
WiringPi-SPI communication using Python
Age calculation using python
Search Twitter using Python
Notes using Python subprocesses
Try using Tweepy [Python2.7]
Create a company name extractor with python using JCLdic
Python notes using perl-ternary operator
Flatten using Python yield from
Scraping using Python 3.5 async / await
Save images using python3 requests
[S3] CRUD with S3 using Python [Python]
Python "if __name__ == ‘__main__’: ”means
[Python] Try using Tkinter's canvas
Try using Kubernetes Client -Python-
[Python] if __name__ == What is'__main__' :?
Python notes using perl-special variables
[Python] Using OpenCV with Python (Basic)
Scraping using Python 3.5 Async syntax
Post to Twitter using Python
Start to Selenium using python
Search algorithm using word2vec [python]
Change python version using pyenv
python: Basics of using scikit-learn ①
# 1 [python3] Simple calculation using variables
Instrument control using Python [pyvisa]
Manipulate spreadsheets locally using Python
Python memo using perl --join
[Python] JSON validation using Voluptuous
Data analysis using python pandas
Translate using googletrans in Python
Using Python mode in Processing
Using OpenCV with Python @Mac
[Python] Shooting game using pyxel
Send using Python with Gmail
[Python3] Format the character string using the variable name as the key.
Predict gender from name using Gender API and Pykakasi in Python
Complement python with emacs using company-jedi
How to install python using anaconda
Initializing global variables using Python decorators
[Python] Loading csv files using pandas
Retry post request using python requests
Python Note: About comparison using is
[Ubuntu] [Python] Object tracking using dlib
__name__
Image capture of firefox using python
[Python] Using OpenCV with Python (Image Filtering)
Precautions when using pit in Python
Summary if using AWS Lambda (Python)
Data acquisition using python googlemap api
Python
Using Rstan from Python with PypeR
Create a python GUI using tkinter
Python: Introduction to Flask: Creating a number identification app using MNIST
Authentication using tweepy-User authentication and application authentication (Python)
[Python] Using OpenCV with Python (Image transformation)
Introducing Python using pyenv on Ubuntu 20.04