| Effective P | No effect 1-P | Odds P/(1-P) | |
|---|---|---|---|
| Chemical A | 0.2 | 0.8 | 0.250 |
| Chemical B | 0.05 | 0.95 | 0.053 |
To consider **, let's assume a model that predicts the pass / fail of the test (1 if pass, 0 if fail) from the number of study hours. ** **
#Library used for numerical calculation
import numpy as np
import pandas as pd
import scipy as sp
from scipy import stats
#Library for drawing graphs
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
#Library for estimating statistical models
import statsmodels.formula.api as smf
import statsmodels.api as sm
#Specifying the number of display digits
%precision 3
#Data acquisition
url = 'https://raw.githubusercontent.com/yumi-ito/sample_data/master/6-3-1-logistic-regression.csv'
#Data reading
df = pd.read_csv(url)
# #Output the first 5 lines of data
df.head()

#Output basic statistics of data
df.describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))

#Draw bar chart
sns.set()
sns.barplot(x = "hours", y = "result", data = df, palette='summer_r')
sns.set () specifies the style of the graph area, which is the default state. Alternatively, use sns.set (style =" whitegrid ") to create a gray scale line on a white background.seaborn.barplot (x column name, y column name, DataFrame name, color palette name).ci = None to the argument to hide it.
pass_rate = df.groupby("hours").mean()

groupby () to group the same elements in the columns specified as arguments, and find the average of each with .mean ().#Estimate the model
mod_glm = smf.glm(formula = "result ~ hours",
data = df,
family = sm.families.Binomial()).fit()
smf.glm () function. glm is an abbreviation for Generalized Linear Models.formula specifies the structure of the model with the objective variable results and the explanatory variable hours.family, is the specification of the probability distribution. Since this example is the binomial distribution, applysm.families.Binomial ().#Output summary of estimation results
mod_glm.summary()

#Draw a regression curve
sns.lmplot(x = "hours", y = "result",
data = df,
logistic = True,
scatter_kws = {"color": "green"},
line_kws = {"color": "black"},
x_jitter = 0.1, y_jitter = 0.02)
logistic is True, then y is assumed to be a binary variable and statsmodels is used to estimate the logistic regression model. A binary variable is a variable that can only take two values, 0 and 1.scatter_kws and line_kws are additional keyword arguments to pass to plt.scatter and plt.plot to specify the color of the scatterplot dots and regression curves.x_jitter and y_jitter specify that the dots should be scattered up and down a little, just for appearance. Since the pass / fail is only 1 or 0, the dots are controlled to overlap.
#Arithmetic progression with column name hours(0~9)Create a DataFrame for
predicted_value = pd.DataFrame({"hours": np.arange(0, 10, 1)})
#Calculate the predicted pass rate
pred = mod_glm.predict(predicted_value)
mod_glm, use the functionpredict ()to calculate the predicted value according to the created data frame predicted_value.
#Get 1 hour and 2 hour pass rates
pred_1 = pred[1]
pred_2 = pred[2]
#Calculate the odds for each
odds_1 = pred_1 / (1 - pred_1)
odds_2 = pred_2 / (1 - pred_2)
#Calculate log odds ratio
print("Log odds ratio:", round(sp.log(odds_2 / odds_1), 3))
#Calculate the coefficients of the model
value = mod_glm.params["hours"]
print("Model coefficients:", round(value, 3))

numpy.exp (x) returns the natural logarithm $ e $ to the $ x $ power.#Take the regression coefficient exp
exp = sp.exp(mod_glm.params["hours"])
print("Coefficient exp:", round(exp, 3))
#Calculate odds ratio
odds = odds_2 / odds_1
print("Odds ratio:", round(odds, 3))

Recommended Posts