[PYTHON] Logistic regression

What is logistic regression?

Similar to linear regression, but used when the objective variable is binary. For example, whether or not this person buys a product, whether or not he hits a stick, whether or not he moves, whether or not he changes jobs, and so on.

Create a prediction model using the following logistic function (sigmoid function). スクリーンショット 2016-05-03 16.47.40.png

The form of the logistic function is as follows. It takes a value from 0 to 1 and increases monotonically. スクリーンショット 2016-05-03 16.48.21.png

The relationship between the matrix x of the objective variable and the explanatory variable y is as follows. (The right side of y = ax + b is exp to the -1 power.) スクリーンショット 2016-05-03 16.49.51.png

Usage data preparation

Use affair data with sklearn.

{get_affair_dataset.py}


from sklearn.linear_model import LogisticRegression #For logistic regression
from sklearn.cross_validation import train_test_split #For cross-validation split
import statsmodels.api as sm

df = sm.datasets.fair.load_pandas().data #Loading affair data

Summary of affair data

{describe_affair.py}


df.head()
スクリーンショット 2016-05-03 16.53.17.png

rate_marriage: happiness, age: age, yrs_married: years of marriage, children: number of children, religious: faith, educ: final academic background, occupation: wife's profession, occupation_husb: husband's profession, affairs: affair experience (greater than 0) And have an affair experience), Had_Affair: Affair flag (1 is set if affairs is 0>)

Easily see the relationship between the presence or absence of infidelity and an appropriate sample

{easy_display1.py}


#Age and presence of affair
sns.countplot('age', data = df.sort('age'), hue = 'Had_Affair', palette='coolwarm')
スクリーンショット 2016-05-03 16.57.17.png

{easy_display2.py}


#Years of marriage and presence or absence of affair
sns.countplot('yrs_married', data = df.sort('yrs_married'), hue = 'Had_Affair', palette='coolwarm')
スクリーンショット 2016-05-03 16.57.21.png

{easy_display3.py}


#Number of children and presence or absence of affair
sns.countplot('children', data = df.sort('children'), hue = 'Had_Affair', palette='coolwarm')
スクリーンショット 2016-05-03 16.57.35.png

Older age / more years of marriage / higher affair rate with children

Try logistic regression

Preprocessing

Before doing this, the occupation variable is a categorical variable, so replace it with a dummy variable. A categorical variable is one in which the size of the value is meaningless.

{change_dummy_value.py}


#numpy get_Convert to dummy variable with dummies.
occ_dummies = pd.get_dummies(df.occupation)
hus_occ_dummies = pd.get_dummies(df.occupation_husb)

#Column name set
occ_dummies.columns = ['occ1','occ2','occ3','occ4','occ5','occ6']
hus_occ_dummies.columns = ['hocc1','hocc2','hocc3','hocc4','hocc5','hocc6']

occ_dummies.head()
スクリーンショット 2016-05-03 17.13.59.png

With the above feeling, which of occ1 to 6 is hit will be replaced with the 0,1 flag.

Next, get the explanatory variables.

{get_x.py}


#Set X with the occupation, husband's occupation, and marital status removed from the original data frame.
X = df.drop(['occupation', 'occupation_husb', 'Had_Affair'], axis =1) 
#Prepare a data frame with occupations as dummy variables
dummys = pd.concat([occ_dummies, hus_occ_dummies], axis =1)
#Combine occupation dummy variable data frame with data frame with occupation etc. deleted
X = pd.concat([X, dummys], axis=1)

X.head()

Data set for explanatory variables so far スクリーンショット 2016-05-03 17.18.57.png

Multicollinearity

When one explanatory variable can represent one or more other explanatory variables, it is said to be multicollinear. For example, this time, occ1 is uniquely determined by the values of occ2 to occ6. (If there is one or more 1s in occ2-6, occ1 = 0, otherwise occ1 = 1) In this case, the inverse matrix cannot be calculated, or even if it can be calculated, the reliability of the obtained result becomes low. So, in order to eliminate this, delete occ1 and hocc1.

{drop_nonavailable_value.py}


X = X.drop('hocc1', axis = 1)
X = X.drop('hocc1', axis = 1)
#Since affairs is used to create the objective variable, this is also excluded from the explanatory variables.
X = X.drop('affairs', axis =1 )

X.head()

Final shape

スクリーンショット 2016-05-03 17.28.45.png

Run with sklearn

{do_logistic_regression.py}


#Objective variable set
Y = df.Had_Affair
Y = np.ravel(Y)    # np.Make Y a one-dimensional array with ravel

#Logistic regression execution
log_model = LogisticRegression() #Instance generation
log_model.fit(X, Y)              #Model creation execution
log_model.score(X, Y)            #Confirmation of model prediction accuracy(72.6%)
> 0.7260446120012567

Check the coefficient of each variable

{confirm_coefficient.py}


#Of the instance.coef_[0]Contains the coefficient
coeff_df = DataFrame([X.columns, log_model.coef_[0]]).T
coeff_df
スクリーンショット 2016-05-03 17.36.33.png

The place where this coefficient is large has a lot of influence. However, since the data units of the explanatory variables are not unified, it is not possible to simply compare them side by side. For example, occ5 is about 9 times as large as yrs_married, so it's not easy to say OK !! without looking at the number of years of marriage.

By the way

As usual, I will write how to divide it into train and test.

{do_logistic_regression_train_test.py}


#Data preparation for train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

log_model2 = LogisticRegression() 
log_model2.fit(X_train, Y_train)           #Model creation with train data
class_predict = log_model2.predict(X_test) #predict test data

from sklearn import metrics #For checking prediction accuracy

metrics.accuracy_score(Y_test, class_predict) #Accuracy check
>0.73115577889447236

You can see that the accuracy is about 73%.

Recommended Posts

Logistic regression
Logistic regression
Machine learning logistic regression
Machine learning algorithm (logistic regression)
Implementing logistic regression with NumPy
Logistic Regression (for beginners) -Code Edition-
Apply Influence function to logistic regression
Linear regression
What is Multinomial Logistic Regression Analysis?
Logistic regression analysis Self-made with python
First TensorFlow (Revised) -Linear Regression and Logistic Regression
PRML Chapter 4 Bayesian Logistic Regression Python Implementation
<Course> Machine Learning Chapter 3: Logistic Regression Model
[Kaggle for super beginners] Titanic (Logistic regression)
[Logistic regression] Implement k-validation with stats models
I implemented Cousera's logistic regression in Python
Poisson regression analysis
Understanding Logistic Regression (1) _ About odds and logit transformations
Try Theano with Kaggle's MNIST Data ~ Logistic Regression ~
Regression analysis method
Implement a discrete-time logistic regression model with stan
Logistic regression implementation with particle swarm optimization method
[Logistic regression] Implement holdout verification with stats models
Points to note when performing logistic regression with Statsmodels
Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
Linear multiple regression, logistic regression, multi-layer perceptron, autoencoder, Chainer yo!
Solving the iris problem with scikit-learn ver1.0 (logistic regression)