Predict gender from name using Gender API and Pykakasi in Python

Introduction

I think there is a usage scene where you want to predict gender from your name. For example, if you ask for gender on the registration form with a membership service, the CVR will drop, so make up for it with a prediction! Is it a scene like that?

There are several ways to predict gender from a name, such as using machine learning to generate a classifier and making predictions, or using an external API to make predictions. This time, it will be an approach to predict gender from name using Gender API in Python.

Gender API is an American company that seems to have made gender predictions from a huge amount of name data. There are several similar services, but this time we will use this Gender API to predict gender.

Preparation

Gender API First, let's create an account for Gender API. After creating, get API_KEY. If you want to use it for free, you can use it for free up to 500 names

Pseudo personal information acquisition

Use Personal Generator to generate pseudo personal information. You can freely select the items to be displayed, but this time, we also want to judge the correct answer, so we will get the serial number, name, name (katakana), and gender. This time, I will try to predict the gender from the names of about 30 people. スクリーンショット 2020-08-10 15.27.17.png

Pykakasi The name to be predicted will be First_name, and whether the name is predicted in Kanji, Katakana, Hiragana, or Romaji will greatly affect the accuracy. In conclusion, probably because it is an overseas service, it was the most accurate to convert it to Romaji and make it predict. (The verification process is omitted.)

Therefore, it is necessary to perform romaji conversion from the name below. For how to use it, refer to the developer's documentation. How to use pykakasi Install the following two packages.

pip install six semidbm
pip install pykakasi

Gender prediction

Gender prediction with python

We will actually predict the gender of the 30 subjects. The general procedure is as follows.

  1. Prepare the target person's dataframe, divide it by double-byte space and generate a name column
  2. Create a Pykakasi instance, set it to convert to Romaji, convert the name and generate a Romaji string
  3. Pass the romaji list to the Gender API and get the prediction result
  4. Merge the prediction result with the original dataframe

gender_estimation.py


import sys
import json
from urllib import request, parse
from urllib.request import urlopen
import pandas as pd
import pykakasi


class GenderEstimation:
    """
Predict gender from romaji-converted name
    """
    __GENDER_API_BASE_URL = 'https://gender-api.com/get?'
    __API_KEY = "your api_key"
    def create_estimated_genders_date_frame(self):
        df = pd.DataFrame(self._estimate_gender())
        print('\n{}Completed gender prediction for a person.'.format((len(df))))
        df.columns = [
            'estimated_gender', 'accuracy', 'samples', 'duration'
        ]
        df1 = self._create_member_data_frame()
        estimated_genders_df = pd.merge(df1, df, left_index=True, right_index=True)
        
        return estimated_genders_df
    
    def _estimate_gender(self):
        unique_names = self._convert_first_name_to_romaji()
        genders = []
        print(u'{}Predict the gender of a person'.format(len(unique_names)))
        for name in unique_names:
            res = request.urlopen(self._gender_api_endpoint(params={
                'name': name,
                'country': 'JP',
                'key': self.__API_KEY
            }))
            decoded = res.read().decode('utf-8')
            data = json.loads(decoded)
            genders.append(
                [data['gender'], data['accuracy'], data['samples'], data['duration']])
            
        return genders
    
    def _gender_api_endpoint(self, params):
        return '{base_url}{param_str}'.format(
            base_url=self.__GENDER_API_BASE_URL, param_str=parse.urlencode(params))
    
    def _convert_first_name_to_romaji(self):
        df = self._create_member_data_frame()
        df['first_name_roma'] = df['first_name'].apply(
            lambda x: self._set_kakasi(x))
        
        return df['first_name_roma']
    
    def _set_kakasi(self, x):
        kakasi = pykakasi.kakasi()
        kakasi.setMode('H', 'a')
        kakasi.setMode('K', 'a')
        kakasi.setMode('J', 'a')
        kakasi.setMode('r', 'Hepburn')
        kakasi.setMode('s', False)
        kakasi.setMode('C', False)
        
        return kakasi.getConverter().do(x)

    def _create_member_data_frame(self):
        df = pd.read_csv('personal_infomation.csv').rename(columns={
            'Serial number':'row_num',
            'Full name':'name',
            'Name (Katakana)':'name_katakana',
            'sex':'gender'
        })
        df['first_name']=df.name_katakana.str.split().str[1]
        print(u"{}Extract the person to be predicted.".format(len(df)))
        return df

Gender prediction results

The data frame of the prediction result is as follows. The response regarding the prediction of the Gender API is defined as follows.

estimated_gender accuracy samples duration
Gender prediction results Prediction correctness Sample size used for prediction Elapsed time to 1 call
スクリーンショット 2020-08-10 15.29.07.png

Gender prediction accuracy verification

Finally, let's examine the accuracy of the gender prediction results. Plot the correct and predicted results and their numbers for the table below to generate a matrix The correct answer rate was almost 100%. In this case, I predicted that only one case was actually a woman, but a man. After all, it seems difficult to predict names such as "Iori" that can be taken by both men and women.

Correct answer Forecast num
male male 11
male female 0
male unknown 0
female male 1
female female 18
female unknown 0
unknown male 0
unknown female 0
unknown unknown 0
Forecast/Correct answer male female unknown Correct answer rate
male 11 0 0 100.00%
female 1 18 0 94.74%
unknown 0 0 0 0%

Recommended Posts

Predict gender from name using Gender API and Pykakasi in Python
Try using ChatWork API and Qiita API in Python
Run Ansible from Python using API
How to get followers and followers from python using the Mastodon API
Mouse operation using Windows API in Python
Notes using cChardet and python3-chardet in Python 3.3.1.
From Python to using MeCab (and CaboCha)
Try using the Kraken API in Python
Tweet using the Twitter API in Python
I tried using UnityCloudBuild API from Python
Play to predict race value and type from Pokemon name in TensorFlow
Development and deployment of REST API in Python using Falcon Web Framework
Detect Japanese characters from images using Google's Cloud Vision API in Python
Try using the BitFlyer Ligntning API in Python
Get image URL using Flickr API in Python
Let's judge emotions using Emotion API in Python
Load and execute command from yml in python
Load images from URLs using Pillow in Python 3
Try using the DropBox Core API in Python
C API in Python 3
Name identification using python
Translator in Python from Visual Studio 2017 (Microsoft Translator Text API)
Upload JPG file using Google Drive API in Python
Initial settings when using the foursquare API in python
Push notifications from Python to Android using Google's API
Get LEAD data using Marketo's REST API in Python
Send and receive Gmail via the Gmail API using Python
OpenVINO using Inference Engine Python API in PC environment
Read and write NFC tags in python using PaSoRi
Speech transcription procedure using Python and Google Cloud Speech API
Using the National Diet Library Search API in Python
Get files from Linux using paramiko and scp [Python]
Sample of getting module name and class name in Python
A little bit from Python using the Jenkins API
Predict from various data in Python using Facebook Prophet, a time series prediction tool
Hit Mastodon's API in Python
Flatten using Python yield from
Try to make it using GUI and PyQt in Python
I tried to create API list.csv in Python from swagger.yaml
Visualize plant activity from space using satellite data and Python
Stack and Queue in Python
Graph time series data in Python using pandas and matplotlib
Blender Python API in Houdini (Python 3)
Unittest and CI in Python
Specification generation and code generation in REST API development (Python edition)
Get product name and lowest price using Amazon Product Advertising API
[Python] Random data extraction / combination from DataFrame using random and pandas
Shoot time-lapse from a PC camera using Python and OpenCV
Translate using googletrans in Python
Using Python mode in Processing
I made a Chatbot using LINE Messaging API and Python
Aggregate and analyze product prices using Rakuten Product Search API [Python]
Use e-Stat API from Python
Collect product information and process data using Rakuten product search API [Python]
I compared Node.js and Python in creating thumbnails using AWS Lambda
[Python] Conversation using OpenJTalk and Talk API (up to voice output)
Regularly upload files to Google Drive using the Google Drive API in Python
Firebase Authentication token issuance in Python and token verification with Fast API
[Python] I wrote a REST API using AWS API Gateway and Lambda.
I made a Chatbot using LINE Messaging API and Python (2) ~ Server ~
[SEO] Flow / sample code when using Google Analytics API in Python