Introduction

Last time, I summarized what I learned from theory of logistic regression.

I tried to deepen my understanding by creating a discriminator that can be binary classified by my own logistic regression. https://qiita.com/Fumio-eisan/items/e2c625c4d28d74cf02f3

This time, we made a model estimate using an actual data set. We have summarized the basic processing of so-called data preprocessing (dummy variableization, column deletion, combination), data interpretation, and multicollinearity, which is a problem in multivariate analysis. There are many implementation contents.

The outline is below.

This dataset
Interpret data by displaying multiple graphs in one with a histogram
Data preprocessing
About multicollinearity
Predict by logistic regression

The version used is as follows.

numpy 1.16.5
pandas 0.25.1
seaborn 0.9.0
scikit-learn 0.21.3

This dataset

This time, we used the data set of the survey results of the presence or absence of infidelity conducted on married women in 1974.

`affair.ipynb`


df = sm.datasets.fair.load_pandas().data
df.head()

If you look at the data, you can see that the period since marriage, age, having children, etc. are described as explanatory variables. And finally, there is a number in the affairs column. 0 indicates that you are not infidelity, and 1 or more indicates that you are (or were) infidelity.

Interpret data by displaying multiple graphs in one in a histogram

Evaluate the difference between the presence and absence of infidelity. First of all, in the current data, the numbers of affairs are different, so divide by affair (1 or more) and not affair (0).

`affair.ipynb`


def affair_check(x):
    if x!=0:
        return 1
    else:
        return 0

df['Had_Affair']=df['affairs'].apply(affair_check)

Interpret the data to look for parameters that are likely to be more relevant to the predictive model. For that purpose, classify by affair (1) and without (0) and make a histogram with each variable. With axes as the return value, give each as an argument in the graph you want to represent.

`affair.ipynb`


fig, axes = plt.subplots(nrows=3, ncols=3,figsize=(10,8))

sns.countplot(df['age'], hue=df['Had_Affair'],ax=axes[0,0])
sns.countplot(df['yrs_married'], hue=df['Had_Affair'],ax=axes[0,1])
sns.countplot(df['children'], hue=df['Had_Affair'],ax=axes[0,2])
sns.countplot(df['rate_marriage'], hue=df['Had_Affair'],ax=axes[1,0])
sns.countplot(df['religious'], hue=df['Had_Affair'],ax=axes[1,1])
sns.countplot(df['educ'], hue=df['Had_Affair'],ax=axes[1,2])
sns.countplot(df['occupation'], hue=df['Had_Affair'],ax=axes[2,0])
sns.countplot(df['occupation_husb'], hue=df['Had_Affair'],ax=axes[2,1])

Now that you can view it all at once, you can now interpret the data. Basically, I think you should focus on the parameters where the peaks are different between the ** infidelity group and the non-affair group. ** **

People who are not affair have a peak around 22 years old, but those who are affair have a peak around 27 years old
Most people who have an affair have two children. On the other hand, those who do not have an affair are most likely to have no children.
Regarding the rich and poor (rate_marriage), those who do not have an affair have the highest number of very good (5), while those who have an affair have the highest number of good (4).

Data preprocessing

Introduction of dummy variables

Now, pre-processing is performed to create a prediction model. In this affair dataset, the categorical variables are Occupation and Husband's Occupation. For these, we introduce dummy variables and classify them with 0/1 expression.

It is such an image. The implementation is as follows.

`affair.ipynb`


occ_dummies = pd.get_dummies(df['occupation'])
hus_occ_dummies = pd.get_dummies(df['occupation_husb'])
occ_dummies.columns = ['occ1','occ2','occ3','occ4','occ5','occ6']
hus_occ_dummies.columns = ['hocc1','hocc2','hocc3','hocc4','hocc5','hocc6']
occ_dummies

I was able to divide it safely.

Delete and connect data

Next, I want to remove the columns I don't need and connect the columns I need. Delete the occupation and Had_Affair columns.

`affair.ipynb`


X = df.drop(['occupation','occupation_husb','Had_Affair'],axis=1)

Then, put together the dummy variables.

`affair.ipynb`


dummies = pd.concat([occ_dummies,hus_occ_dummies],axis=1)

Finally, combine the dummy variable with the original data.

`affair.ipynb`


XX = pd.concat([X,dummies],axis= 1)

About multicollinearity

Next, let us consider multicollinearity. This is a problem that appears as the types of explanatory variables increase. Among these explanatory variables, the phenomenon in which the correlation coefficients are strong with each other is called ** multicollinearity **. If there is a lot of multicollinearity, the accuracy of the regression equation may become extremely poor, and the analysis result may become unstable.

For example, in a model that predicts house prices, "number of rooms" and "room area" are expected to have a strong correlation. In such cases, you can avoid multicollinearity by excluding one variable.

This time, I would like to make a model by excluding occ1, hocc1 = students from the dummy variable occupation.

`affair.ipynb`


XX = XX.drop('occ1',axis=1)
XX = XX.drop('hocc1',axis=1)

The relationship is as shown above.

Predict by logistic regression

Then predict the model. This time, I would like to make a simple prediction using the logistic regression of scikit learn. The first model is trained only with the training data. Then make a prediction with test data.

`affair.ipynb`


X_train, X_test, Y_train, Y_test = train_test_split(XX, Y)
model2 = LogisticRegression()
model2.fit(X_train, Y_train)
class_predict = model2.predict(X_test)
print(metrics.accuracy_score(Y_test,class_predict))

0.707286432160804

It turned out that the correct answer rate was about 70%. Then, what happens if the data that was erased while avoiding the multicollinearity mentioned earlier is not erased as it is (= the point is that the data is as it is)?

`affair.ipynb`


X2_train, X2_test, Y2_train, Y2_test = train_test_split(X2, Y)
model3 = LogisticRegression()
model3.fit(X2_train, Y2_train)
class_predict2 = model3.predict(X2_test)
print(metrics.accuracy_score(Y2_test,class_predict2))

0.9748743718592965

The correct answer rate was high at 97%. ** In this case, we can see that it was better to leave the data as it is because it does not cause multicollinearity. ** **

In other words, it seems that whether or not multicollinearity should be taken into consideration must be considered once when all the data is included in the calculation and when it is deleted. I found that the empirical part is the procedure of saying things.

At the end

Data was interpreted using pandas and matplotlib, and preprocessing was performed in consideration of multicollinearity. Since it is a tutorial-like dataset, it looks like it has progressed smoothly, but I thought that pandas was still handled, such as graph drawing and combining. 　 Also, since the implementation of logistic regression itself is very simple, it was very convenient to be able to calculate without knowing what was happening inside.

The full program is here. https://github.com/Fumio-eisan/affairs_20200412

[PYTHON] I tried to understand how to use Pandas and multicollinearity based on the Affairs dataset.

Introduction

This dataset

affair.ipynb

Interpret data by displaying multiple graphs in one in a histogram

affair.ipynb

affair.ipynb

Data preprocessing

Introduction of dummy variables

affair.ipynb

Delete and connect data

affair.ipynb

affair.ipynb

affair.ipynb

About multicollinearity

affair.ipynb

Predict by logistic regression

affair.ipynb

affair.ipynb

At the end

`affair.ipynb`

`affair.ipynb`

`affair.ipynb`

`affair.ipynb`

`affair.ipynb`

`affair.ipynb`

`affair.ipynb`

`affair.ipynb`

`affair.ipynb`

`affair.ipynb`