[PYTHON] Predict the presence or absence of infidelity by machine learning

Introduction

I tried to analyze data using python with reference to Udemy's [50,000 people in the world] Practical Python Data Science. .. The data used this time is sample data contained in a library called Statsmodels, which is a paper of a survey conducted in 1974 asking whether or not there was an affair with a married woman.

Affairs dataset

The purpose of this time is Using sample data, we will create a model that predicts the presence or absence of infidelity by machine learning, and predict which attributes are affecting the result.

*** There is no intention in choosing this data, and considering that there is a possibility that falsehood due to self-report is included, we do not consider the credibility of the data and treat it as sample data to the last. *** ***

environment: Pyhton3 scikit-learn version 0.21.2 (Udemy course and scikit-learn version are different) jupyter notebook+Anaconda

** Don't explain **: Environment Basic grammar for Python, Pandas, Numpy, matplotlib (others will be explained in comments) Explanation of mathematical background

** Explain **: Logistic regression Explanatory variables and objective variables Data preparation and visualization Data preprocessing Model construction using scikit-learn Summary

What is logistic regression?

Logistic regression is a regression analysis in which the objective variable (the data you want to acquire) converges to a value between 0 and 1. Specifically, the value can be converged by using the sigmoid function. It seems that its characteristics are used for probability prediction and binary classification. This time, I used logistic regression because I classify the presence or absence of affair into binary classification of 1 and 0.

Data preparation and visualization

#Required library import
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import math

#seaborn is a library that can draw graphs beautifully. It seems to be popular.
#set_Change style with style. This time, select white grid and select with grit with a white background.
#If it is troublesome.set()Just be fashionable
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

#scikit-Module import required for learn
#cross_validation can only be used with older versions
#2.From 0 model_use selection
from sklearn.linear_model import LogisticRegressin
from sklearn.model_selection import train_test_split

#Module used when evaluating a model
from sklearn import metrics

#Import to use statsmodels sample data
#It may be necessary to install other than Anaconda
import statsmodels.api as sm

Now that we're ready, let's take a look at the data overview.

#Load sample data into Pandas DataFrame
df = sm.datasets.fair.load_pandas().data

#Let's start with an overview of the data
df.info()
#output
# RangeIndex: 6366 entries, 0 to 6365
# Data columns (total 9 columns):
# rate_marriage      6366 non-null float64
# age                6366 non-null float64
# yrs_married        6366 non-null float64
# children           6366 non-null float64
# religious          6366 non-null float64
# educ               6366 non-null float64
# occupation         6366 non-null float64
# occupation_husb    6366 non-null float64
# affairs            6366 non-null float64
# dtypes: float64(9)
# memory usage: 447.7 KB

#Next, let's look at the first 5 lines
df.head()

rate_ marriage	age	yrs_married	children	religious	educ	occupation	occupation_husb	affairs
3	32	9.0	3	3	17	2	5	0.1111
3	27	13.0	3	1	14	3	4	3.2308
4	22	2.5	0	1	16	3	5	1.4000
4	37	16.5	4	3	16	5	5	0.7273
5	27	9.0	1	1	14	3	4	4.6667

The number of rows is 6366, the number of columns is composed of the objective variable affairs and the explanatory variable total 9, and you can see that Null does not exist. To supplement the column names

・ Rate_marriage: Self-evaluation of marriage ・ Educ: Educational background ・ Children: Number of children ・ Religious: Religious ・ Occupation: Occupation ・ Occupation_husb: Husband's occupation However, you can check the details on the statsmodels website.

*** Objective variable *** refers to the variable you want to predict. In this case, "affairs", which is a variable for the presence or absence of affair, is that. *** Explanatory variables *** are variables used to predict the objective variable. This time all variables except affairs.

This time, we need to set the variable to two values to check for affairs, but the objective variable affairs is a continuous real value. This is because the content of the question is the time when affairs are done. So we'll add a new Had_Affair column to store the result through a function that converts non-zero numbers to 1.

#Had if affairs is non-zero_affairs。
def affair_check(x):
    if x != 0:
        return 1
    else:
        return 0
#The apply argument applies the function to the specified column.
df['Had_Affair'] = df['affairs'].apply(affair_check)
#Output the first 5 lines
df.head()

rate_marriage	age	yrs_married	children	religious	educ	occupation	occupation_ husb	affairs	Had_Affair
3	32	9.0	3	3	17	2	5	0.1111	1
3	27	13.0	3	1	14	3	4	3.2308	1
4	22	2.5	0	1	16	3	5	1.4000	1
4	37	16.5	4	3	16	5	5	0.7273	1
5	27	9.0	1	1	14	3	4	4.6667	1

I was able to add it. Now let's visualize the data and easily find out which explanatory variables are influencing. Group by Had_Affair and calculate the average for each column.

df.groupby('Had_Affair').mean()

Had_Affair	rate_marriage	age	yrs_married	children	religious	educ	occupation	occupation_husb	affairs
0	4.330	28.39	7.989	1.239	2.505	14.32	3.405	3.834	0.000
1	3.647	30.54	11.152	1.729	2.262	13.97	3.464	3.885	2.187

You can see that the column with "Had_Affair" in the second row has a long marriage and a low self-evaluation of the marriage. Now let's visualize the relationship with the length of marriage with a histogram using seaborn (like a fashionable matplotlib).

#Data is aggregated and visualized using seaborn's countplot method, arguments are X-axis, target DF, column name is Had_Binary classification and color specification with Affair
sns.countplot('yrs_married',data=df.sort_values('yrs_married'),hue='Had_Affair',palette='coolwarm')

ダウンロード (1).png

There seems to be a relationship between marriage and the presence or absence of affair. Next, let's visualize the length of marriage and the rate of having an affair.

#The y-axis of barplot outputs the average. Had_Since Affair is a value of 1 and 0, the average is calculated to obtain the ratio of 1.
sns.barplot(data=df, x='yrs_married', y='Had_Affair')

ダウンロード (3).png

If the marriage life exceeds 9 years, the rate of having an affair will exceed 40%. It seems that you can make some predictions by looking at other data in advance, but we will move on to the next step in this area.

Data preprocessing

Now that the visualization is complete, we will preprocess the data. Specifically, in order to fit the machine learning model, the explanatory variable and the objective variable are separated, the data values are aligned, and the missing values are dealt with.

Now let's align the data values. The data strings of "occupation" and "occupation_husb" that indicate occupations are only assigned numbers for convenience in order to categorize them, so the numbers are meaningless. There is no profession.

Therefore, create a new column for occupational categorical data by occupation. If the record is applicable, the data is arranged by dividing it into 2 values, 1 otherwise. It's a tedious task, but it's a snap with Pandas' dummy variable generation function.

Then, since the occupation column is no longer needed, delete it, assign the objective variable to Y, assign the explanatory variable to X, and delete affairs, which is the original data of the objective variable.

When outputting, it is one table, but since there are too many columns and it is hard to see with Qiita, it is divided into two.

#Use a function that creates a dummy variable for pandas. scikit-It seems that it is also in learn.
occ_dummies = pd.get_dummies(df['occupation'])
hus_occ_dummies = pd.get_dummies(df['occupation_husb'])

#Name the category name. Actually, it is easier to see the column name of the original data, but I gave up because it was troublesome.
occ_dummies.columns = ['occ1','occ2','occ3','occ4','occ5','occ6']
hus_occ_dummies.columns = ['hocc1','hocc2','hocc3','hocc4','hocc5','hocc6']

#The column of occupation that is no longer needed and the objective variable "Had"_Delete "Affair". Also affairs.
#For axis, 0 specifies the row and 1 specifies the column.
#drop method takes place as an argument=If you do not enter True, it will not be deleted from the original DataFrame.
X = df.drop(['occupation','occupation_husb','Had_Affair','affairs'],axis=1)

#Combine the dummy variables into the DataFrame of the explanatory variable X.
dummies = pd.concat([occ_dummies,hus_occ_dummies],axis=1)
X = pd.concat([X,dummies],axis=1)

#Assign the objective variable to Y
Y = df.Had_Affair

#output
X.head()

rate_marriage	age	yrs_married	children	religious	educ
3	32	9.0	3	3	17
3	27	13.0	3	1	14
4	22	2.5	0	1	16
4	37	16.5	4	3	16
5	27	9.0	1	1	14

occ2	occ3	occ5	hocc4	hocc5
1	0	0	0	1
0	1	0	1	0
0	1	0	0	1
0	0	1	0	1
0	1	0	1	0

It seems that analysis may not be possible if there is a strong correlation between independent variables. I call it ** multicollinearity **, but I couldn't understand the details even if I googled it, so I'll look into it while studying statistics around next month.

For the time being, the one with high correlation in this data is the occupation column using dummy variables, so it seems that it can be dealt with by deleting one by one.

#I can deal with this for the time being
X = X.drop('occ1',axis=1)
X = X.drop('hocc1',axis=1)

Since the objective variable Y is Series, change it to array, which is a primary array, to fit the model. This completes the data preprocessing.


type(Y)
Y = np.ravel(Y)

Model construction using scikit-learn

Build a logistic regression model using scikit-learn.

#Create an instance of the LogisticRegression class.
log_model = LogisticRegression() 
#Create a model using the data.
log_model.fit(X,Y)
#Let's check the accuracy of the model.
log_model.score(X,Y)
#output
#0.7260446120012567

The accuracy of this model is about 73%. Is this reasonable because it trains the model and the parameters are the defaults? Now let's display the regression coefficient and explore "Which variable contributes to the prediction?"

#Create a DataFrame to store the variable name and its coefficients.
#coef_Displays the regression coefficient.
coeff_df = DataFrame([X.columns, log_model.coef_[0]]).T
coeff_df

0	1
rate_marriage	-0.72992
age	-0.05343
yrs_married	0.10210
children	0.01495
religious	-0.37498
educ	0.02590
occ2	0.27846
occ3	0.58384
occ4	0.35833
occ5	0.99972
occ6	0.31673
hocc2	0.48310
hocc3	0.65189
hocc4	0.42345
hocc5	0.44224
hocc6	0.39460

You can see the regression coefficient when the model was created for the explanatory variables. If the regression coefficient is positive, the higher the value of that variable, the greater the chance of infidelity. If it is negative, the opposite is true. From this table, it seems that the possibility of infidelity decreases as the self-evaluation of marriage and the view of religion increase, and the possibility of infidelity increases as the number of years after marriage increases. It is also displayed by occupation, but since the value of 1 is deleted when taking measures against multicollinearity, it seems better to look at it as a reference level (By the way, occ5, which is a fairly high value). Since the occupation is managerial, it may be a value that is intuitively convincing)

Summary

If you want to improve the accuracy, you can do normalization and trial and error of parameters. However, considering the credibility of the data, I thought it would be more learning to analyze the relationship to the results by attributes by looking at the regression coefficients used in the model.

Digression

Posting the table output by DataFrame to Qiita was very difficult and took about 5 hours.

At first, I tried to convert it to a matplotlib table and output it as an image, but I gave up because the characters in the index became small and I didn't know how to fix it. Next, I tried to use a library called pytablewriter that converts DataFrame to Markdown, but since it is not a library distributed by Anaconda, I had no choice but to install it with PIP. The error "cannot import name" is occurring in the imported library, so check it.

If you install PIP in Anaconda environment, the libraries may collide and it may be troublesome.

Oh! I'm surprised! If you think about it, the dependent library versions can be different for Anaconda and PIP, so it's likely to cause problems. I didn't care about it until now, so when I checked it on the Conda List, there were countless Pypi, so I didn't see it. I was wondering if there would be any problems with other languages, such as NPM and yarn, and when I asked my friend's engineer, I received a thankful answer, "The library is saved in the same place!", So the truth is in the dark. So the countermeasure is to create another Anaconda environment or create another environment that only installs PIP, but I select the latter and install the library with PIP from scratch, but install Statsmodels with PIP In some cases, an error occurs (it's easier with Anaconda), and if something goes wrong, it's solved safely.

*** I respect the posters who are creating tables in Markdown quickly. I would like to know if there is any way. *** ***

occ2	occ3	occ5	hocc4	hocc5
1	0	0	0	1
0	1	0	1	0
0	1	0	0	1
0	0	1	0	1
0	1	0	1	0

occ2	occ3	occ5	hocc4	hocc5
1	0	0	0	1
0	1	0	1	0
0	1	0	0	1
0	0	1	0	1
0	1	0	1	0

occ2	occ3	occ5	hocc4	hocc5
1	0	0	0	1
0	1	0	1	0
0	1	0	0	1
0	0	1	0	1
0	1	0	1	0