[PYTHON] I tried to understand how to use Pandas and multicollinearity based on the Affairs dataset.

Introduction

Last time, I summarized what I learned from theory of logistic regression.

I tried to deepen my understanding by creating a discriminator that can be binary classified by my own logistic regression. https://qiita.com/Fumio-eisan/items/e2c625c4d28d74cf02f3

This time, we made a model estimate using an actual data set. We have summarized the basic processing of so-called data preprocessing (dummy variableization, column deletion, combination), data interpretation, and multicollinearity, which is a problem in multivariate analysis. There are many implementation contents.

The outline is below.

The version used is as follows.

This dataset

This time, we used the data set of the survey results of the presence or absence of infidelity conducted on married women in 1974.

affair.ipynb


df = sm.datasets.fair.load_pandas().data
df.head()

image.png

If you look at the data, you can see that the period since marriage, age, having children, etc. are described as explanatory variables. And finally, there is a number in the affairs column. 0 indicates that you are not infidelity, and 1 or more indicates that you are (or were) infidelity.

Interpret data by displaying multiple graphs in one in a histogram

Evaluate the difference between the presence and absence of infidelity. First of all, in the current data, the numbers of affairs are different, so divide by affair (1 or more) and not affair (0).

affair.ipynb


def affair_check(x):
    if x!=0:
        return 1
    else:
        return 0

df['Had_Affair']=df['affairs'].apply(affair_check)

Interpret the data to look for parameters that are likely to be more relevant to the predictive model. For that purpose, classify by affair (1) and without (0) and make a histogram with each variable. With axes as the return value, give each as an argument in the graph you want to represent.

affair.ipynb


fig, axes = plt.subplots(nrows=3, ncols=3,figsize=(10,8))

sns.countplot(df['age'], hue=df['Had_Affair'],ax=axes[0,0])
sns.countplot(df['yrs_married'], hue=df['Had_Affair'],ax=axes[0,1])
sns.countplot(df['children'], hue=df['Had_Affair'],ax=axes[0,2])
sns.countplot(df['rate_marriage'], hue=df['Had_Affair'],ax=axes[1,0])
sns.countplot(df['religious'], hue=df['Had_Affair'],ax=axes[1,1])
sns.countplot(df['educ'], hue=df['Had_Affair'],ax=axes[1,2])
sns.countplot(df['occupation'], hue=df['Had_Affair'],ax=axes[2,0])
sns.countplot(df['occupation_husb'], hue=df['Had_Affair'],ax=axes[2,1])

101.png

Now that you can view it all at once, you can now interpret the data. Basically, I think you should focus on the parameters where the peaks are different between the ** infidelity group and the non-affair group. ** **

Data preprocessing

Introduction of dummy variables

Now, pre-processing is performed to create a prediction model. In this affair dataset, the categorical variables are Occupation and Husband's Occupation. For these, we introduce dummy variables and classify them with 0/1 expression. image.png

It is such an image. The implementation is as follows.

affair.ipynb


occ_dummies = pd.get_dummies(df['occupation'])
hus_occ_dummies = pd.get_dummies(df['occupation_husb'])
occ_dummies.columns = ['occ1','occ2','occ3','occ4','occ5','occ6']
hus_occ_dummies.columns = ['hocc1','hocc2','hocc3','hocc4','hocc5','hocc6']
occ_dummies

image.png

I was able to divide it safely.

Delete and connect data

Next, I want to remove the columns I don't need and connect the columns I need. Delete the occupation and Had_Affair columns.

affair.ipynb


X = df.drop(['occupation','occupation_husb','Had_Affair'],axis=1)

Then, put together the dummy variables.

affair.ipynb


dummies = pd.concat([occ_dummies,hus_occ_dummies],axis=1)

Finally, combine the dummy variable with the original data.

affair.ipynb


XX = pd.concat([X,dummies],axis= 1)

About multicollinearity

Next, let us consider multicollinearity. This is a problem that appears as the types of explanatory variables increase. Among these explanatory variables, the phenomenon in which the correlation coefficients are strong with each other is called ** multicollinearity **. If there is a lot of multicollinearity, the accuracy of the regression equation may become extremely poor, and the analysis result may become unstable.

For example, in a model that predicts house prices, "number of rooms" and "room area" are expected to have a strong correlation. In such cases, you can avoid multicollinearity by excluding one variable.

This time, I would like to make a model by excluding occ1, hocc1 = students from the dummy variable occupation.

affair.ipynb


XX = XX.drop('occ1',axis=1)
XX = XX.drop('hocc1',axis=1)

image.png

The relationship is as shown above.

Predict by logistic regression

Then predict the model. This time, I would like to make a simple prediction using the logistic regression of scikit learn. The first model is trained only with the training data. Then make a prediction with test data.

affair.ipynb


X_train, X_test, Y_train, Y_test = train_test_split(XX, Y)
model2 = LogisticRegression()
model2.fit(X_train, Y_train)
class_predict = model2.predict(X_test)
print(metrics.accuracy_score(Y_test,class_predict))

0.707286432160804

It turned out that the correct answer rate was about 70%. Then, what happens if the data that was erased while avoiding the multicollinearity mentioned earlier is not erased as it is (= the point is that the data is as it is)?

affair.ipynb


X2_train, X2_test, Y2_train, Y2_test = train_test_split(X2, Y)
model3 = LogisticRegression()
model3.fit(X2_train, Y2_train)
class_predict2 = model3.predict(X2_test)
print(metrics.accuracy_score(Y2_test,class_predict2))

0.9748743718592965

The correct answer rate was high at 97%. ** In this case, we can see that it was better to leave the data as it is because it does not cause multicollinearity. ** **

In other words, it seems that whether or not multicollinearity should be taken into consideration must be considered once when all the data is included in the calculation and when it is deleted. I found that the empirical part is the procedure of saying things.

At the end

Data was interpreted using pandas and matplotlib, and preprocessing was performed in consideration of multicollinearity. Since it is a tutorial-like dataset, it looks like it has progressed smoothly, but I thought that pandas was still handled, such as graph drawing and combining.   Also, since the implementation of logistic regression itself is very simple, it was very convenient to be able to calculate without knowing what was happening inside.

The full program is here. https://github.com/Fumio-eisan/affairs_20200412

Recommended Posts

I tried to understand how to use Pandas and multicollinearity based on the Affairs dataset.
I tried to summarize how to use pandas in python
I tried to summarize how to use the EPEL repository again
How to use pandas Timestamp and date_range
How to use Pandas 2
I tried to use Resultoon on Mac + AVT-C875, but I was frustrated on the way.
I tried to use Twitter Scraper on AWS Lambda and it didn't work.
The tree.plot_tree of scikit-learn was very easy and convenient, so I tried to summarize how to use it easily.
I tried to summarize how to use matplotlib of python
How to use the grep command and frequent samples
How to use argparse and the difference between optparse
I want to use the R dataset in python
I tried to notify the honeypot report on LINE
I tried to install scrapy on Anaconda and couldn't
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
I tried to compare the processing speed with dplyr of R and pandas of Python
Understand how to use django-filter
How to use Pandas Rolling
How to use the decorator
I tried to simulate how the infection spreads with Python
[Hyperledger Iroha] Notes on how to use the Python SDK
I tried to summarize the code often used in Pandas
I tried to illustrate the time and time in C language
I tried to display the time and today's weather w
I didn't know how to use the [python] for statement
I tried to enumerate the differences between java and python
Notes on how to use marshmallow in the schema library
I tried changing the python script from 2.7.11 to 3.6.0 on windows10
I tried to launch ipython cluster to the minimum on AWS
[Git] I tried to make it easier to understand how to use git stash using a concrete example
How to use the zip function
How to use the optparse module
I made my own 3-layer forward propagation neural network and tried to understand the calculation deeply.
[Python] How to use Pandas Series
How to use Dataiku on Windows
I tried cross-validation based on the grid search results with scikit-learn
Notes on how to use pywinauto
I tried to visualize the Beverage Preference Dataset by tensor decomposition.
How to read the SNLI dataset
Notes on how to use featuretools
How to install pandas on EC2 (How to deal with MemoryError and PermissionError)
I tried to explain Pytorch dataset
How to use homebrew on Debian
I tried to digitize the stamp stamped on paper using OpenCV
How to use .bash_profile and .bashrc
I tried to register a station on the IoT platform "Rimotte"
I tried to get started with Bitcoin Systre on the weekend
How to install and use Graphviz
How to use Jupyter on the front end of supercomputer ITO
How to use machine learning for work? 01_ Understand the purpose of machine learning
I summarized how to change the boot parameters of GRUB and GRUB2
I will explain how to use Pandas in an easy-to-understand manner.
Notes on how to use doctest
I don't really understand the difference between modules, packages and libraries, so I tried to organize them.
How to use the ConfigParser module
[Introduction to Python] How to use the Boolean operator (and ・ or ・ not)
I tried to move the ball
I tried to estimate the interval.
I tried to push the Sphinx document to BitBucket and it will be automatically reflected on the web server
For the time being using FastAPI, I want to display how to use API like that on swagger
[Python] I tried to visualize the night on the Galactic Railroad with WordCloud!