If you use scikit-learn's OneHotEncoder as it is, dummy variables will be created for the number of levels of categorical variables. In this case, since multicollinearity occurs in the linear regression method, we want to reduce the dummy variable to the number of levels -1. I found out how to do it, so make a note of it.
If you set the drop option of OneHotEncoder to "first", it will remove the first dummy variable.
Here, let's extract the column that stores the following categorical variables and make them categorical variables.
[['D']
['D']
['D']
['T']
['T']
['T']
['N']
['N']
['N']]
The source is as follows. If you set the drop option to "first", you will get angry if you do not set handle_unknown ='error'.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
import numpy as np
def main():
X = np.array([
[1, "D"],
[3, "D"],
[5, "D"],
[2, "T"],
[4, "T"],
[6, "T"],
[1, "N"],
[8, "N"],
[2, "N"],
])
y = np.array([2, 6, 10, 6, 12, 18, 1, 8, 2])
#Take out the second row
category = X[:, [1]]
print(category)
encoder = OneHotEncoder(handle_unknown='error', drop='first')
encoder.fit(category)
result = encoder.transform(category)
print(result.toarray())
if __name__ == "__main__":
main()
When drop ='first' is not added.
[[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[0. 0. 1.]
[0. 0. 1.]
[0. 0. 1.]
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]]
When drop ='fist' is added
[[0. 0.]
[0. 0.]
[0. 0.]
[0. 1.]
[0. 1.]
[0. 1.]
[1. 0.]
[1. 0.]
[1. 0.]]
Certainly the first column is gone. Now you can call the fit method without any hesitation.