2. Multivariate analysis spelled out in Python 5-3. Logistic regression analysis (stats models)

Odds

Odds are synonymous with "magnification" in horse racing, and if the betting ticket hits, refunds will be made according to the odds. If the odds were 10 times higher, the stake would be 10 times higher and returned ...
In statistics, ** odds ** are the ratio ** of the probability that an event will occur (P) and the probability that it will not occur (1-P) **, and the calculation method is $ \ displaystyle \ frac { It is P} {1-P} $.
For example, there are drugs A and B, and the odds are shown as the results of each clinical trial as two events, "effective" and "ineffective".

	Effective P	No effect 1-P	Odds P/(1-P)
Chemical A	0.2	0.8	0.250
Chemical B	0.05	0.95	0.053

Odds ratio

** Odds ratio ** is the ratio of two odds **. The odds ratio of drug A to drug B is $ 0.250 / 0.053 = 4.75 $, which is the odds of A divided by the odds of B.
The higher (or smaller) the odds ratio means that the events are related, and if the odds ratio is 1.0, the relationships between the events are independent and do not affect each other. further···
** The logarithm of the odds ratio is the logarithm odds ratio, and the regression coefficient is the logarithm odds ratio when the explanatory variable is changed by one unit. ** **

Relationship between odds ratio and regression coefficient in logistic regression

To consider **, let's assume a model that predicts the pass / fail of the test (1 if pass, 0 if fail) from the number of study hours. ** **

⑴ Import library

#Library used for numerical calculation
import numpy as np
import pandas as pd
import scipy as sp
from scipy import stats

#Library for drawing graphs
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

#Library for estimating statistical models
import statsmodels.formula.api as smf
import statsmodels.api as sm

#Specifying the number of display digits
%precision 3

scipy is a library for performing advanced scientific calculations, and has a module that summarizes statistical functions called stats.
https://docs.scipy.org/doc/scipy/reference/stats.html
seaborn is a visualization library based on matplotlib that makes matplotlib graphs more expressive.
https://seaborn.pydata.org/introduction.html
statsmodels, as the name implies, is a module that provides classes and functions for estimating various statistical models.
https://www.statsmodels.org/devel/

⑵ Data reading and confirmation

#Data acquisition
url = 'https://raw.githubusercontent.com/yumi-ito/sample_data/master/6-3-1-logistic-regression.csv'

#Data reading
df = pd.read_csv(url)

# #Output the first 5 lines of data
df.head()

#Output basic statistics of data
df.describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))

There are 10 explanatory variables, "0, 1, 2, ..., 8, 9 hours" only for study time (hours).
The objective variable is the result of the test, which averages 0.46, so 46 out of 100 pass.

#Draw bar chart
sns.set()
sns.barplot(x = "hours", y = "result", data = df, palette='summer_r')

sns.set () specifies the style of the graph area, which is the default state. Alternatively, use sns.set (style =" whitegrid ") to create a gray scale line on a white background.
The argument is seaborn.barplot (x column name, y column name, DataFrame name, color palette name).
By default, an error bar (error range) indicating the 95% confidence interval is drawn. Add ci = None to the argument to hide it.
https://seaborn.pydata.org/generated/seaborn.barplot.html

The longer the study time, the higher the pass rate tends to be, and the pass rate level fluctuates greatly depending on whether it is 5 hours or less or 6 hours or more.
Let's calculate the pass rate for each study time.

pass_rate = df.groupby("hours").mean()

Use the pandas function groupby () to group the same elements in the columns specified as arguments, and find the average of each with .mean ().
If the study time is 6 hours or more, the pass rate will increase dramatically.

(3) Model estimation and confirmation of results

#Estimate the model
mod_glm = smf.glm(formula = "result ~ hours",
                  data = df,
                  family = sm.families.Binomial()).fit()

If you want to estimate generalized linear models with statsmodels, use the smf.glm () function. glm is an abbreviation for Generalized Linear Models.
The first argument formula specifies the structure of the model with the objective variable results and the explanatory variable hours.
The third argument, family, is the specification of the probability distribution. Since this example is the binomial distribution, applysm.families.Binomial ().

#Output summary of estimation results
mod_glm.summary()

#Draw a regression curve
sns.lmplot(x = "hours", y = "result",
           data = df,
           logistic = True,
           scatter_kws = {"color": "green"},
           line_kws = {"color": "black"},
           x_jitter = 0.1, y_jitter = 0.02)

If the argument logistic is True, then y is assumed to be a binary variable and statsmodels is used to estimate the logistic regression model. A binary variable is a variable that can only take two values, 0 and 1.
The arguments scatter_kws and line_kws are additional keyword arguments to pass to plt.scatter and plt.plot to specify the color of the scatterplot dots and regression curves.
The arguments x_jitter and y_jitter specify that the dots should be scattered up and down a little, just for appearance. Since the pass / fail is only 1 or 0, the dots are controlled to overlap.

Try to get the pass rate for each study time numerically.

#Arithmetic progression with column name hours(0～9)Create a DataFrame for
predicted_value = pd.DataFrame({"hours": np.arange(0, 10, 1)})

#Calculate the predicted pass rate
pred = mod_glm.predict(predicted_value)

For the model mod_glm, use the functionpredict ()to calculate the predicted value according to the created data frame predicted_value.

⑷ Find the log odds ratio and compare it with the coefficient

#Get 1 hour and 2 hour pass rates
pred_1 = pred[1]
pred_2 = pred[2]

#Calculate the odds for each
odds_1 = pred_1 / (1 - pred_1)
odds_2 = pred_2 / (1 - pred_2)

#Calculate log odds ratio
print("Log odds ratio:", round(sp.log(odds_2 / odds_1), 3))

#Calculate the coefficients of the model
value = mod_glm.params["hours"]
print("Model coefficients:", round(value, 3))

** The log odds ratio matches the regression coefficient **.
Also, if you take the exp (exponential function) of the regression coefficient, it matches the odds ratio.
numpy.exp (x) returns the natural logarithm $ e $ to the $ x $ power.

#Take the regression coefficient exp
exp = sp.exp(mod_glm.params["hours"])
print("Coefficient exp:", round(exp, 3))

#Calculate odds ratio
odds = odds_2 / odds_1
print("Odds ratio:", round(odds, 3))

In other words, when the explanatory variable changes by 1, the objective variable changes by the odds ratio.
In some cases, it may be easier to understand or explain by converting to the odds ratio rather than the regression coefficient.