[PYTHON] [Logistic regression] Implement k-validation with stats models

Introduction

In python, scikit-learn and statsmodels are mainly used as libraries that can use logistic regression models. While statsmodels has advantages that scikit-learn does not have, such as automatically performing a significant difference test of coefficients, it does not support the holdout method and cross-validation method, which are typical model evaluation methods. So, this time, let's create the code to implement the k-fold cross-validation method with stats models.

See here for the implementation of the holdout method using stats models.

Library installation

sample.ipynb


import numpy as np
import pandas as pd
import statsmodels.api as sm

Data installation and preprocessing

For the data, I will use the data related to crowdfunding that I independently collected for my graduation research. This data is on my github page, so please download it to your environment if necessary.

sample.ipynb


#Read csv file
cultured = pd.read_csv("cultured.path to csv")

#Create objective variable 0:Crowdfunding failure 1:Crowdfunding success
cultured["achievement"] = cultured["Total amount of support"] // cultured["Target amount"]
cultured["target"] = 0
cultured.loc[cultured['achievement']>=1,'target'] = 1

#Objective variable(y)And explanatory variables(x)Divide into
#add_Create a constant term with constant
y = cultured["target"]
x_pre = cultured[["Target amount","Number of supporters","word count","Number of activity reports"]]
x = sm.add_constant(x_pre)

This data is for predicting whether the crowdfunding project succeeds (y = 1) or fails (y = 0) from the explanatory variables target amount, number of supporters, number of characters, and number of activity reports. In scikit-learn logistic regression, constant terms are generated arbitrarily, but statsmodels does not have that function, so they are generated using add_constant (). The explanatory variable (x) looks like this. 68747470733a2f2f71696974612d696d6167652d73746f72652e73332e61702d6e6f727468656173742d312e616d617a6f6e6177732e636f6d2f302f3439353539372f36313663373366352d383966622d613034362d366533642d3738306238333332326164392e706e67.png

Implementation (main of this article)

First, create a train_split function that splits the data.

sample.ipynb


#Divide the data using the remainder when the index is divided by the number of divisions k
#k:Division number, r:Residual when divided by k
def train_test(x, y, k, r):
    #Create an ndarray array for the number of columns from 0 to x
    #Consider this as an index
    idx = np.arange(0, x.shape[0])
    #The remainder of dividing the index by k is equal to r idx_Store in test
    idx_test = idx[np.fmod(idx, k) == r]
    idx_train = idx[np.fmod(idx, k) != r]
    #idx_Only data with the same index as the number stored in test x_test(y_test)Store in
    x_test = x.iloc[idx_test,:]
    x_train = x.iloc[idx_train,:]
    y_test = y[idx_test]
    y_train = y[idx_train]
    return x_train, x_test, y_train, y_test

Implement k-validation cross-validation using the train_test function.

sample.ipynb


def cross_validation(x,y,k):
    #Set scores list
    scores = []
    #Covers the remainder that can be taken with a for statement
    #If it is divided into 5 parts, the remainder that can be taken when the index is divided by 5 is 0.,1,2,3,4
    for r in range(k):
        X_train, X_test, y_train, y_test = train_test(x,y,k,r)
        #Learning using training data
        model = sm.Logit(y_train, X_train)
        results = model.fit()
        #Store predictions for test data in pred
        #However, note that the output value is the probability that the objective variable will be 1 (in this case, the probability of success).
        pred = results.predict(X_test)
        #Probability is 0.Converts greater than 5 to 1 and others to 0
        #Use in-list notation
        result = [1 if i>0.5 else 0 for i in pred]
        #train_The order of the indexes is messed up with the test function, so reassign
        y_test_re = y_test.reset_index(drop=True)
        #Store initial value in count
        count=0
        #y_Add 1 to count if test matches the predicted value
        for i in range(len(X_test)):
            if y_test_re[i] == result[i]:
                count+=1
        #Add the result for each remainder r to scores
        scores.append(count/len(y_test))
    #Outputs the average of k prediction accuracy stored in scores
    return sum(scores) / len(scores)

result

sample.ipynb


cross_validation(x,y,5)

When you perform 5-fold cross validation ... It was 0.8485 in my environment!

Recommended Posts

[Logistic regression] Implement k-validation with stats models
[Logistic regression] Implement holdout verification with stats models
Implement a discrete-time logistic regression model with stan
Implementing logistic regression with NumPy
2. Multivariate analysis spelled out in Python 5-3. Logistic regression analysis (stats models)
Logistic regression analysis Self-made with python
Logistic regression
Logistic regression
Introduction to Statistical Hypothesis Testing with stats models
Try Theano with Kaggle's MNIST Data ~ Logistic Regression ~
Introduction to Vector Autoregressive Models (VAR) with stats models
Logistic regression implementation with particle swarm optimization method
Points to note when performing logistic regression with Statsmodels
Solving the iris problem with scikit-learn ver1.0 (logistic regression)
Introduction to Vector Error Correcting Models (VECM) with stats models
Linear regression with statsmodels
Implement FReLU with tf.keras
Machine learning logistic regression
Regression with linear model
Regression analysis with NumPy
Try regression with TensorFlow
Try to implement linear regression using Pytorch with Google Colaboratory