An intuitive understanding of the principles of residual plots requires an understanding of the meaning of the residuals. For the theoretical background of the residual plot, refer to "FWL Theorem". In this article, I will briefly explain the meaning of the residuals in order to understand the residual plot. After that, I will show you how to write python code.
Consider a two-variable linear regression analysis. $ Y_i = \alpha + \beta X_{1i} + \beta X_{2i} +\epsilon_i $
If you can understand the meaning of the error term, you will understand why this straight line is obtained by two-step regression analysis.
The error term is a term that explains the information that the explanatory variable cannot explain the explained variable. That is, it has all the information about the explained variable except the explanatory variable. Consider a simple regression analysis of Y with X1. $ Y_i = \pi_i + \pi_2 X_{1i} + \eta_i $ $ \ Eta_i $ in this model has all the information about Y that $ X_ {1} $ can't explain. Next, consider regression analysis of X2 with X1. $ X_{2i} = \theta_1 + \theta_2 X_{1i} + \zeta_i $ $ \ Zeta_i $ in this model has all the information about X2 that $ X_ {1} $ cannot explain.
Have you ever imagined what to do at this point? $ \ eta $ contains information about X2 for Y, but not information about X1. On the other hand, in $ \ zeta $, X2 contains the information of Y, but does not contain the information of X1. Therefore, the residual regression analysis claimed by FWL's theorem will regress $ \ eta_i $ by $ \ zeta_i $.
Go as far as showing the residual plot.
import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt
#read data
data=pd.read_csv("___.csv")
Y=data.loc[:,"name"]
X1=data.loc[:,["names1"]]
X2=data.loc[:,"names2"]
#it is necessary to add constant variable
X1=sm.add_constant(X1)
model1=sm.OLS(Y, X1)
result1=model1.fit()
Y1_hat=result1.predict(X1)
model2=sm.OLS(X2, X1)
result2=model2.fit()
Y2_hat=result2.predict(X1)
e1=Y-Y1_hat
e2=X2-Y2_hat
plt.plot(e1,e2,linestyle="None",marker=".") 
plt.show()
You can analyze until the residual plot is no longer correlated. Since it is plotted, it may be more reliable to see if there is a correlation than to look at the test statistic.
Recommended Posts