[PYTHON] Predict the gender of Twitter users with machine learning

We had data for 20,000 Twitter users tagged with gender, so we used this data to predict the gender of Twitter users. Ruby is used for text processing and Python is used for machine learning.

Conclusion first

Gender prediction by simple machine learning using Twitter's profile was only about 60% accurate.

The data used this time is foreign language data, and the result is different from the Japanese profile, but the accuracy seems to be not so much as well. The reason I think this is because "Twitter user data is difficult to determine gender even if people see it in the first place."

Twitter user gender determination procedure

Ruby is used for steps 1 to 5, and Python is used for steps 6.

  1. List the words in your profile
  2. Record the number of occurrences of each word
  3. Eliminate words that appear extremely infrequently or too often
  4. Represent the user's profile as a vector by regarding the number of words as the number of dimensions
  5. Create a label for the correct answer data
  6. Apply machine learning

Pre-process text in Ruby

The Ruby code that performs steps 1 to 5 above is as follows. With a method like this one, the performance depends greatly on this text processing part. There are countless ways to do it, and this code does really minimal text processing.

# https://www.kaggle.com/crowdflower/twitter-user-gender-classification
def parse_kaggle_data
  str = File.read('gender-classifier-DFE-791531.csv', encoding: 'ISO-8859-1:UTF-8')
  lines = str.split("\r").map { |l| l.split(',') }
  header = lines[0]
  users = lines.drop(1).map { |l| header.map.with_index { |h, i| [h, l[i]] }.to_h }
  users = users.select { |u| %w(female male).include?(u['gender']) && u['gender:confidence'] == '1' }
  [users.map { |u| u['description'] }, users.map { |u| u['gender'] }]

def split_to_words(text_array)
  text_array.map { |d| d.split(/([\s"]|__REP__)/) }.flatten.
      map { |w| w.gsub(/^#/, '') }.
      map { |w| w.gsub(/[^.]\.+$/, '') }.
      map { |w| w.gsub(/[^!]!+$/, '') }.
      map { |w| w.gsub(/^\(/, '') }.
      map { |w| w.gsub(/^\)/, '') }.
      delete_if { |w| w.length < 2 }.

def count_words(text_array, word_array)
  words_count = Hash.new(0)
  text_array.each do |d|
    word_array.each do |w|
      if d.include?(w)
        words_count[w] += 1

descriptions, genders = parse_kaggle_data

desc_words = split_to_words(descriptions)
desc_words_count = count_words(descriptions, desc_words)
filtered_desc_words = desc_words.select { |w| desc_words_count[w] > 2 && desc_words_count[w] < 500 }
desc_vectors = descriptions.map { |d| filtered_desc_words.map { |w| d.include?(w) ? 1 : 0 } }
File.write('data/description_vectors.txt', desc_vectors.map { |v| v.join(' ') }.join("\n"))

labels = genders.map do |g|
  case g
  when '';        0
  when 'brand';   1
  when 'female';  2
  when 'male';    3
  when 'unknown'; 4
File.write('data/labels.txt', labels.join("\n"))

Machine learning with Python

I've tried Naive Bayes, Logistic Regression, Random Forest, and Support Vector Machines, all with similar results.

Method accuracy
Naive Bayes (normal distribution) 0.5493
Naive Bayes (Bernoulli) 0.6367
Logistic regression 0.6151
Random forest 0.6339
Support vector machine 0.6303

It should be noted that each method has a tacit assumption about the original data, but this time we do not consider it and simply compare the results.

# sudo yum install -y python3
# sudo pip3 install -U pip numpy sklearn ipython

import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import confusion_matrix
import pickle

description_vectors = np.loadtxt('data/description_vectors.txt')
labels = np.loadtxt('data/labels.txt')

(x_train, x_test, y_train, y_test) = train_test_split(description_vectors, labels)

clf = GaussianNB().fit(x_train, y_train)
clf = BernoulliNB().fit(x_train, y_train)
clf = LogisticRegression().fit(x_train, y_train)
clf = RandomForestClassifier().fit(x_train, y_train)
clf = SVC(C = 1.0).fit(x_train, y_train)

y_pred = clf.predict(x_test)
np.mean(y_test == y_pred)

# Grid search

# best params: {'C': 1.0, 'gamma': 'scale', 'kernel': 'rbf'}
parameters = [{'kernel': ['linear', 'rbf', 'poly', 'sigmoid'], 'C': np.logspace(-2, 2, 5), 'gamma': ['scale']}]
clf = GridSearchCV(SVC(), parameters, verbose = True, n_jobs = -1)
clf.fit(x_train, y_train)

# best params: {'max_depth': 100, 'n_estimators': 300}
parameters = [{'n_estimators': [30, 50, 100, 300], 'max_depth': [25, 30, 40, 50, 100]}]
clf = GridSearchCV(RandomForestClassifier(), parameters, verbose = True, n_jobs = -1)
clf.fit(x_train, y_train)


print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

# Model persistence

pickle.dump(clf, open('model.sav', 'wb'))
clf = pickle.load(open('model.sav', 'rb'))

Related Links

Twitter User Gender Classification | Kaggle Using machine learning to predict gender

Recommended Posts

Predict the gender of Twitter users with machine learning
Summary of the basic flow of machine learning with Python
Record of the first machine learning challenge with Keras
Predict the presence or absence of infidelity by machine learning
A story stuck with the installation of the machine learning library JAX
[Machine learning] Check the performance of the classifier with handwritten character data
A beginner of machine learning tried to predict Arima Kinen with python
The story of doing deep learning with TPU
About the development contents of machine learning (Example)
Predict the second round of summer 2016 with scikit-learn
I tried to predict the presence or absence of snow by machine learning.
Align the number of samples between classes of data for machine learning with Python
The story of sharing the pyenv environment with multiple users
Impressions of taking the Udacity Machine Learning Engineer Nano-degree
Try to predict forex (FX) with non-deep machine learning
About testing in the implementation of machine learning models
Predict the number of people infected with COVID-19 with Prophet
Predicting the goal time of a full marathon with machine learning-③: Visualizing data with Python-
The first step of machine learning ~ For those who want to implement with python ~
Machine learning learned with Pokemon
Basics of Machine Learning (Notes)
Machine learning with Python! Preparation
Machine learning Minesweeper with PyTorch
Importance of machine learning datasets
Beginning with Python machine learning
Try machine learning with Kaggle
Try to evaluate the performance of machine learning / regression model
The result of Java engineers learning machine learning in Python www
Survey on the use of machine learning in real services
Try to evaluate the performance of machine learning / classification model
How to increase the number of machine learning dataset images
[Machine learning] I tried to summarize the theory of Adaboost
Try to predict if tweets will burn with machine learning
I made an API with Docker that returns the predicted value of the machine learning model
Significance of machine learning and mini-batch learning
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Introduction ~
Specifying the date with the Twitter API
I tried machine learning with liblinear
Machine learning with python (1) Overall classification
Machine learning ③ Summary of decision tree
Explore the maze with reinforcement learning
Try machine learning with scikit-learn SVM
How to use machine learning for work? 01_ Understand the purpose of machine learning
[Introduction to StyleGAN] Unique learning of anime with your own machine ♬
I checked the image of Science University on Twitter with Word2Vec.
Machine learning model management to avoid quarreling with the business side
Quantum-inspired machine learning with tensor networks
Validate the learning model with Pylearn2
Get started with machine learning with SageMaker
Try to predict the triplet of boat race by ranking learning
"Scraping & machine learning with Python" Learning memo
For those of you who glance at the log while learning with machine learning ~ Muscle training with LightGBM ~
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~
Feature engineering for machine learning starting with the 1st Google Colaboratory --Binarization and discretization of count data
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Battle Edition ~
Try to predict the value of the water level gauge by machine learning using the open data of Data City Sabae
I tried calling the prediction API of the machine learning model from WordPress
REST API of model made with Python with Watson Machine Learning (CP4D edition)
[Introduction to machine learning] Until you run the sample code with chainer
Source code of sound source separation (machine learning practice series) learned with Python
I made a GAN with Keras, so I made a video of the learning process.