[PYTHON] [Logistic regression] Implement holdout verification with stats models


In python, scikit-learn and statsmodels are mainly used as libraries that can use logistic regression models. While statsmodels has advantages that scikit-learn does not have, such as automatically performing a significant difference test of coefficients, it does not support the holdout method and cross-validation method, which are typical model evaluation methods. So, this time, let's write the code to implement the holdout method in stats models.

See here for the implementation of k-validation using stats models.

Library installation


import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

Data installation and preprocessing

For the data, I will use the data related to crowdfunding that I independently collected for my graduation research. This data is on my github page, so please download it to your environment if necessary.


#Read csv file
cultured = pd.read_csv("cultured.path to csv")

#Create objective variable 0:Crowdfunding failure 1:Crowdfunding success
cultured["achievement"] = cultured["Total amount of support"] // cultured["Target amount"]
cultured["target"] = 0
cultured.loc[cultured['achievement']>=1,'target'] = 1

#Objective variable(y)And explanatory variables(x)Divide into
#add_Create a constant term with constant
y = cultured["target"]
x_pre = cultured[["Target amount","Number of supporters","word count","Number of activity reports"]]
x = sm.add_constant(x_pre)

This data is for predicting whether the crowdfunding project succeeds (y = 1) or fails (y = 0) from the explanatory variables target amount, number of supporters, number of characters, and number of activity reports. In scikit-learn logistic regression, constant terms are generated arbitrarily, but statsmodels does not have that function, so they are generated using add_constant (). The explanatory variable (x) looks like this. スクリーンショット 2020-12-23 22.38.07.png

Implementation (main of this article)


#Holdout method
def hold_out(x,y):
    #Divide the data into training data and test data
    #test_size is the ratio of test data to total data
    X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=42)
    #Learning using training data
    model = sm.Logit(y_train, X_train)
    results = model.fit()
    #Store predictions for test data in pred
    #However, note that the output value is the probability that the objective variable will be 1 (in this case, the probability of success).
    pred = results.predict(X_test)
    #Probability is 0.Converts greater than 5 to 1 and others to 0
    #Use in-list notation
    result = [1 if i>0.5 else 0 for i in pred]
    #train_test_The index order is messed up with split, so reassign
    y_test_re = y_test.reset_index(drop=True)
    #Store initial value in count
    #y_Add 1 to count if test matches the predicted value
    for i in range(len(y_test)):
        if y_test_re[i] == result[i]:
    #The return value is the accuracy of the prediction
    return count/len(y_test)




When you execute the function ... It was 0.878 in my environment!

Recommended Posts

[Logistic regression] Implement holdout verification with stats models
[Logistic regression] Implement k-validation with stats models
Implement a discrete-time logistic regression model with stan
Implementing logistic regression with NumPy
2. Multivariate analysis spelled out in Python 5-3. Logistic regression analysis (stats models)
Logistic regression analysis Self-made with python
Logistic regression
Logistic regression
Introduction to Statistical Hypothesis Testing with stats models
Try Theano with Kaggle's MNIST Data ~ Logistic Regression ~
Multivariable regression model with scikit-learn --SVR comparison verification
Logistic regression implementation with particle swarm optimization method
Points to note when performing logistic regression with Statsmodels