[PYTHON] Predict the presence or absence of infidelity by machine learning

Introduction

I tried to analyze data using python with reference to Udemy's [50,000 people in the world] Practical Python Data Science. .. The data used this time is sample data contained in a library called Statsmodels, which is a paper of a survey conducted in 1974 asking whether or not there was an affair with a married woman.

Affairs dataset

The purpose of this time is Using sample data, we will create a model that predicts the presence or absence of infidelity by machine learning, and predict which attributes are affecting the result.

*** There is no intention in choosing this data, and considering that there is a possibility that falsehood due to self-report is included, we do not consider the credibility of the data and treat it as sample data to the last. *** ***

environment: Pyhton3 scikit-learn version 0.21.2 (Udemy course and scikit-learn version are different) jupyter notebook+Anaconda

** Don't explain **: Environment Basic grammar for Python, Pandas, Numpy, matplotlib (others will be explained in comments) Explanation of mathematical background

** Explain **: Logistic regression Explanatory variables and objective variables Data preparation and visualization Data preprocessing Model construction using scikit-learn Summary

What is logistic regression?

Logistic regression is a regression analysis in which the objective variable (the data you want to acquire) converges to a value between 0 and 1. Specifically, the value can be converged by using the sigmoid function. It seems that its characteristics are used for probability prediction and binary classification. This time, I used logistic regression because I classify the presence or absence of affair into binary classification of 1 and 0.

Data preparation and visualization

#Required library import
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import math

#seaborn is a library that can draw graphs beautifully. It seems to be popular.
#set_Change style with style. This time, select white grid and select with grit with a white background.
#If it is troublesome.set()Just be fashionable
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

#scikit-Module import required for learn
#cross_validation can only be used with older versions
#2.From 0 model_use selection
from sklearn.linear_model import LogisticRegressin
from sklearn.model_selection import train_test_split

#Module used when evaluating a model
from sklearn import metrics

#Import to use statsmodels sample data
#It may be necessary to install other than Anaconda
import statsmodels.api as sm

Now that we're ready, let's take a look at the data overview.

#Load sample data into Pandas DataFrame
df = sm.datasets.fair.load_pandas().data

#Let's start with an overview of the data
df.info()
#output
# RangeIndex: 6366 entries, 0 to 6365
# Data columns (total 9 columns):
# rate_marriage      6366 non-null float64
# age                6366 non-null float64
# yrs_married        6366 non-null float64
# children           6366 non-null float64
# religious          6366 non-null float64
# educ               6366 non-null float64
# occupation         6366 non-null float64
# occupation_husb    6366 non-null float64
# affairs            6366 non-null float64
# dtypes: float64(9)
# memory usage: 447.7 KB

#Next, let's look at the first 5 lines
df.head()
rate_
marriage
age yrs_married children religious educ occupation occupation_husb affairs
3 32 9.0 3 3 17 2 5 0.1111
3 27 13.0 3 1 14 3 4 3.2308
4 22 2.5 0 1 16 3 5 1.4000
4 37 16.5 4 3 16 5 5 0.7273
5 27 9.0 1 1 14 3 4 4.6667

The number of rows is 6366, the number of columns is composed of the objective variable affairs and the explanatory variable total 9, and you can see that Null does not exist. To supplement the column names

・ Rate_marriage: Self-evaluation of marriage ・ Educ: Educational background ・ Children: Number of children ・ Religious: Religious ・ Occupation: Occupation ・ Occupation_husb: Husband's occupation However, you can check the details on the statsmodels website.

*** Objective variable *** refers to the variable you want to predict. In this case, "affairs", which is a variable for the presence or absence of affair, is that. *** Explanatory variables *** are variables used to predict the objective variable. This time all variables except affairs.

This time, we need to set the variable to two values to check for affairs, but the objective variable affairs is a continuous real value. This is because the content of the question is the time when affairs are done. So we'll add a new Had_Affair column to store the result through a function that converts non-zero numbers to 1.

#Had if affairs is non-zero_affairs。
def affair_check(x):
    if x != 0:
        return 1
    else:
        return 0
#The apply argument applies the function to the specified column.
df['Had_Affair'] = df['affairs'].apply(affair_check)
#Output the first 5 lines
df.head()
rate_marriage age yrs_married children religious educ occupation occupation_
husb
affairs Had_Affair
3 32 9.0 3 3 17 2 5 0.1111 1
3 27 13.0 3 1 14 3 4 3.2308 1
4 22 2.5 0 1 16 3 5 1.4000 1
4 37 16.5 4 3 16 5 5 0.7273 1
5 27 9.0 1 1 14 3 4 4.6667 1

I was able to add it. Now let's visualize the data and easily find out which explanatory variables are influencing. Group by Had_Affair and calculate the average for each column.

df.groupby('Had_Affair').mean()
Had_Affair rate_marriage age yrs_married children religious educ occupation occupation_husb affairs
0 4.330 28.39 7.989 1.239 2.505 14.32 3.405 3.834 0.000
1 3.647 30.54 11.152 1.729 2.262 13.97 3.464 3.885 2.187

You can see that the column with "Had_Affair" in the second row has a long marriage and a low self-evaluation of the marriage. Now let's visualize the relationship with the length of marriage with a histogram using seaborn (like a fashionable matplotlib).

#Data is aggregated and visualized using seaborn's countplot method, arguments are X-axis, target DF, column name is Had_Binary classification and color specification with Affair
sns.countplot('yrs_married',data=df.sort_values('yrs_married'),hue='Had_Affair',palette='coolwarm')

ダウンロード (1).png

There seems to be a relationship between marriage and the presence or absence of affair. Next, let's visualize the length of marriage and the rate of having an affair.

#The y-axis of barplot outputs the average. Had_Since Affair is a value of 1 and 0, the average is calculated to obtain the ratio of 1.
sns.barplot(data=df, x='yrs_married', y='Had_Affair')

ダウンロード (3).png

If the marriage life exceeds 9 years, the rate of having an affair will exceed 40%. It seems that you can make some predictions by looking at other data in advance, but we will move on to the next step in this area.

Data preprocessing

Now that the visualization is complete, we will preprocess the data. Specifically, in order to fit the machine learning model, the explanatory variable and the objective variable are separated, the data values are aligned, and the missing values are dealt with.

Now let's align the data values. The data strings of "occupation" and "occupation_husb" that indicate occupations are only assigned numbers for convenience in order to categorize them, so the numbers are meaningless. There is no profession.

Therefore, create a new column for occupational categorical data by occupation. If the record is applicable, the data is arranged by dividing it into 2 values, 1 otherwise. It's a tedious task, but it's a snap with Pandas' dummy variable generation function.

Then, since the occupation column is no longer needed, delete it, assign the objective variable to Y, assign the explanatory variable to X, and delete affairs, which is the original data of the objective variable.

When outputting, it is one table, but since there are too many columns and it is hard to see with Qiita, it is divided into two.

#Use a function that creates a dummy variable for pandas. scikit-It seems that it is also in learn.
occ_dummies = pd.get_dummies(df['occupation'])
hus_occ_dummies = pd.get_dummies(df['occupation_husb'])

#Name the category name. Actually, it is easier to see the column name of the original data, but I gave up because it was troublesome.
occ_dummies.columns = ['occ1','occ2','occ3','occ4','occ5','occ6']
hus_occ_dummies.columns = ['hocc1','hocc2','hocc3','hocc4','hocc5','hocc6']

#The column of occupation that is no longer needed and the objective variable "Had"_Delete "Affair". Also affairs.
#For axis, 0 specifies the row and 1 specifies the column.
#drop method takes place as an argument=If you do not enter True, it will not be deleted from the original DataFrame.
X = df.drop(['occupation','occupation_husb','Had_Affair','affairs'],axis=1)

#Combine the dummy variables into the DataFrame of the explanatory variable X.
dummies = pd.concat([occ_dummies,hus_occ_dummies],axis=1)
X = pd.concat([X,dummies],axis=1)

#Assign the objective variable to Y
Y = df.Had_Affair

#output
X.head()
rate_marriage age yrs_married children religious educ
3 32 9.0 3 3 17
3 27 13.0 3 1 14
4 22 2.5 0 1 16
4 37 16.5 4 3 16
5 27 9.0 1 1 14
occ1 occ2 occ3 occ4 occ5 occ6 hocc1 hocc2 hocc3 hocc4 hocc5 hocc6
0 1 0 0 0 0 0 0 0 0 1 0
0 0 1 0 0 0 0 0 0 1 0 0
0 0 1 0 0 0 0 0 0 0 1 0
0 0 0 0 1 0 0 0 0 0 1 0
0 0 1 0 0 0 0 0 0 1 0 0

It seems that analysis may not be possible if there is a strong correlation between independent variables. I call it ** multicollinearity **, but I couldn't understand the details even if I googled it, so I'll look into it while studying statistics around next month.

For the time being, the one with high correlation in this data is the occupation column using dummy variables, so it seems that it can be dealt with by deleting one by one.

#I can deal with this for the time being
X = X.drop('occ1',axis=1)
X = X.drop('hocc1',axis=1)

Since the objective variable Y is Series, change it to array, which is a primary array, to fit the model. This completes the data preprocessing.


type(Y)
Y = np.ravel(Y)

Model construction using scikit-learn

Build a logistic regression model using scikit-learn.

#Create an instance of the LogisticRegression class.
log_model = LogisticRegression() 
#Create a model using the data.
log_model.fit(X,Y)
#Let's check the accuracy of the model.
log_model.score(X,Y)
#output
#0.7260446120012567

The accuracy of this model is about 73%. Is this reasonable because it trains the model and the parameters are the defaults? Now let's display the regression coefficient and explore "Which variable contributes to the prediction?"

#Create a DataFrame to store the variable name and its coefficients.
#coef_Displays the regression coefficient.
coeff_df = DataFrame([X.columns, log_model.coef_[0]]).T
coeff_df
0 1
rate_marriage -0.72992
age -0.05343
yrs_married 0.10210
children 0.01495
religious -0.37498
educ 0.02590
occ2 0.27846
occ3 0.58384
occ4 0.35833
occ5 0.99972
occ6 0.31673
hocc2 0.48310
hocc3 0.65189
hocc4 0.42345
hocc5 0.44224
hocc6 0.39460

You can see the regression coefficient when the model was created for the explanatory variables. If the regression coefficient is positive, the higher the value of that variable, the greater the chance of infidelity. If it is negative, the opposite is true. From this table, it seems that the possibility of infidelity decreases as the self-evaluation of marriage and the view of religion increase, and the possibility of infidelity increases as the number of years after marriage increases. It is also displayed by occupation, but since the value of 1 is deleted when taking measures against multicollinearity, it seems better to look at it as a reference level (By the way, occ5, which is a fairly high value). Since the occupation is managerial, it may be a value that is intuitively convincing)

Summary

If you want to improve the accuracy, you can do normalization and trial and error of parameters. However, considering the credibility of the data, I thought it would be more learning to analyze the relationship to the results by attributes by looking at the regression coefficients used in the model.

Digression

Posting the table output by DataFrame to Qiita was very difficult and took about 5 hours.

At first, I tried to convert it to a matplotlib table and output it as an image, but I gave up because the characters in the index became small and I didn't know how to fix it. Next, I tried to use a library called pytablewriter that converts DataFrame to Markdown, but since it is not a library distributed by Anaconda, I had no choice but to install it with PIP. The error "cannot import name" is occurring in the imported library, so check it.

If you install PIP in Anaconda environment, the libraries may collide and it may be troublesome.

Oh! I'm surprised! If you think about it, the dependent library versions can be different for Anaconda and PIP, so it's likely to cause problems. I didn't care about it until now, so when I checked it on the Conda List, there were countless Pypi, so I didn't see it. I was wondering if there would be any problems with other languages, such as NPM and yarn, and when I asked my friend's engineer, I received a thankful answer, "The library is saved in the same place!", So the truth is in the dark. So the countermeasure is to create another Anaconda environment or create another environment that only installs PIP, but I select the latter and install the library with PIP from scratch, but install Statsmodels with PIP In some cases, an error occurs (it's easier with Anaconda), and if something goes wrong, it's solved safely.

*** I respect the posters who are creating tables in Markdown quickly. I would like to know if there is any way. *** ***

Recommended Posts

Predict the presence or absence of infidelity by machine learning
I tried to predict the presence or absence of snow by machine learning.
Predict the gender of Twitter users with machine learning
Confirmed the difference in the presence or absence of random processing during mini-batch learning with chainer
Try to predict the triplet of boat race by ranking learning
Judgment of igneous rock by machine learning ②
Python learning memo for machine learning by Chainer until the end of Chapter 2
Judge the authenticity of posted articles by machine learning (Google Prediction API).
Classification of guitar images by machine learning Part 1
About the development contents of machine learning (Example)
Analysis of shared space usage by machine learning
[Translation] scikit-learn 0.18 Tutorial Introduction of machine learning by scikit-learn
Reasonable price estimation of Mercari by machine learning
Classification of guitar images by machine learning Part 2
Try to predict the value of the water level gauge by machine learning using the open data of Data City Sabae
How the reference of the python array changes depending on the presence or absence of subscripts
Predict short-lived works of Weekly Shonen Jump by machine learning (Part 2: Learning and evaluation)
I tried to predict the change in snowfall for 2 years by machine learning
Implementation of a model that predicts the exchange rate (dollar-yen rate) by machine learning
Predict short-lived works of Weekly Shonen Jump by machine learning (Part 1: Data analysis)
Impressions of taking the Udacity Machine Learning Engineer Nano-degree
About testing in the implementation of machine learning models
Summary of the basic flow of machine learning with Python
Record of the first machine learning challenge with Keras
rsync Behavior changes depending on the presence or absence of the slash in the copy source
Try to evaluate the performance of machine learning / regression model
Basics of Machine Learning (Notes)
The result of Java engineers learning machine learning in Python www
Survey on the use of machine learning in real services
A concrete method of predicting horse racing by machine learning and simulating the recovery rate
An example of a mechanism that returns a prediction by HTTP from the result of machine learning
Try to evaluate the performance of machine learning / classification model
Judging the victory or defeat of Shadowverse by image recognition
I tried to verify the yin and yang classification of Hololive members by machine learning
How to increase the number of machine learning dataset images
[Machine learning] I tried to summarize the theory of Adaboost
Importance of machine learning datasets
4 [/] Four Arithmetic by Machine Learning
A story stuck with the installation of the machine learning library JAX
[Machine learning] Check the performance of the classifier with handwritten character data
Perform morphological analysis in the machine learning environment launched by GCE
How to use machine learning for work? 01_ Understand the purpose of machine learning
A memorandum of scraping & machine learning [development technique] by Python (Chapter 4)
A memorandum of scraping & machine learning [development technique] by Python (Chapter 5)
Evaluate the accuracy of the learning model by cross-validation from scikit learn
Machine learning summary by Python beginners
Machine learning ③ Summary of decision tree
I tried calling the prediction API of the machine learning model from WordPress
One-click data prediction for the field realized by fully automatic machine learning
A beginner of machine learning tried to predict Arima Kinen with python
Python learning memo for machine learning by Chainer Chapter 13 Basics of neural networks
Basic machine learning procedure: ③ Compare and examine the selection method of features
I tried the common story of using Deep Learning to predict the Nikkei 225
Machine learning algorithm (generalization of linear regression)
Predict power demand with machine learning Part 2
Making Sandwichman's Tale by Machine Learning ver4
[Learning memo] Basics of class by python
Record the steps to understand machine learning
Learning notes from the beginning of Python 1
2020 Recommended 20 selections of introductory machine learning books
[Failure] Find Maki Horikita by machine learning