Mokuji: Regression problem processing procedure

① Processing of missing values (repeat ① and ②) ② Understand the characteristics of the data according to the basic procedure Basic statistics Visualization ③ Feature generation ④ Creation of multiple regression model and execution of prediction ⑤ Evaluation of prediction results (RMSE) ⑥ Improvement of model prediction accuracy (return to ⑤)

Handling of missing values

Important functions

DataFrame.isnull().sum() 　　DataFrame.dropna() 　　DataFrame.fillna()

Data manipulation

#Extraction of data for which beds is NaN(Select in SQL* where )
beds_nan_data = DataFrame[DataFrame['beds'].isnull()]

# bed_Display the number of cases for each type of type
print(beds_nan_data['bed_type'].value_counts())

# beds,bedrooms,Delete data containing missing values in bathrooms
data = data.dropna(subset=['beds','bedrooms','bathrooms'])

#Variable mean_review to val_scores_Substitute the average value of rating.
mean_val = data['review_scores_rating'].mean()

# review_scores_Complement the missing value of rating with the average value.
data['review_scores_rating'] = data['review_scores_rating'].fillna(mean_val)

Data visualization

Basic statistics

#Show the contents of the element
print(DataFrame['Column name'].unique())
#Show the number of elements
print(DataFrame['Column name'].nunique())
#important:Display the number of each element in descending order
print(DataFrame['Column name'].value_counts())

#2,4,7,7,10,12,13,17,20,22,Display 30 statistics
#Display values such as mean, median, maximum and mode
s = pd.Series([2,4,7,7,10,12,13,17,20,22,30])
print(s.describe())
print(s.mode())

Histogram, bar graph

#Extract only the y column from the variable data and assign it to the variable y
y = data['y']
#Histogram visualization
y.plot.hist(title='Accommodation price')
#Functions required to display visualization results
plt.show()

# value_counts()Display the result of
v = data['Column name'].value_counts()
v.plot.bar()
plt.show()

Remove outliers

#I want to extract data with y of 10 or more(I want to delete less than 10)So the condition is data['y'] >= 10

#Check the number of lines before deletion
before_rows = data.shape[0]
print(before_rows)

#Delete
data = data[data['y'] >= 10]

#Check the number of lines after deletion
after_rows = data.shape[0]
print(after_rows)

Drawing a box plot

#Drawing a box plot
sns.boxplot(data=DataFrame, x='Horizontal column name', y='Vertical column name')

#Limit the display range
plt.ylim(0,600)

plt.show()

Drawing a scatter plot

DataFrame.plot.scatter(x='Horizontal column name', y='Vertical column name')
plt.show()

Extract under multiple conditions

#bathrooms is 0 from data.Extract data with 0 and y of 1000 dollars or more and display it.
#DataFrame if multiple conditions are met[(conditions) & (conditions)]
data_tmp = data[(data['bathrooms']==0.0) & (data['y'] >= 1000)]

Feature processing

function

** If it is a self-made function, it is necessary to call the function after defining the function, but if it is a lambda function, the function definition and call can be done in one line. ** **

# property.Let's read csv and assign it to the variable data.
mydata = pd.read_csv('property.csv')

# cleaning_fee is t,Since it is f, 1,Convert it to 0.
def change_tf(x):
    if x == 't':
        return 1
    elif x == 'f':
        return 0
mydata['cleaning_fee'] = mydata['cleaning_fee'].apply(change_tf)

Evaluation function

How to calculate RMSE

#Library import
import numpy as np
from sklearn.metrics import mean_squared_error as MSE

#Variable preparation
actual = [3,4,6,2,4,6,1]
pred = [4,2,6,5,3,2,3]

#Calculation of MSE
mse = MSE(actual,pred)
print(mse)

#RMSE calculation
rmse = np.sqrt(mse)
print(rmse)

** RMSE is not implemented in scikit-learn. MSE (Mean Squared Error), which is the value before taking the square root of RMSE, is implemented. The RMSE is calculated by implementing the part that takes the square root of the MSE. ** **

Linear regression model

#Let's import Linear Regression.
from sklearn.linear_model import LinearRegression

#Model preparation
lr = LinearRegression()

#Learn model
lr.fit(X_train, y_train)

#Display of partial regression coefficient
print(pd.DataFrame(lr.coef_, index=X_train.columns))

#Display of intercept
print(lr.coef_)

# X_Fill in the blanks as you would expect for the train.
y_pred_train = lr.predict(X_train)

#Confirmation of forecast results
print(y_pred_train)

#Round and display results
chk = int(round(y_pred_train[1]))

Summary: Explanatory variables, data partitioning, modeling, model evaluation (RMSE)

#Explanatory variables used for forecasting, data partitioning, model creation, model evaluation(RMSE)
select_columns = ['room_type','accommodates','bed_type','bathrooms','cleaning_fee']
dummy_data = pd.get_dummies(data[select_columns],drop_first=True)

X_train,X_test,y_train,y_test = train_test_split(dummy_data, data['y'], random_state = 1234)

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred_train = lr.predict(X_train)

# X_Calculate RMSE for train
rmse_train = np.sqrt(MSE(y_train, y_pred_train))

#X left for evaluation_Predict using test
y_pred_test = lr.predict(X_test)

# X_Calculation of RMSE for test prediction
rmse_test = np.sqrt(MSE(y_test,y_pred_test))

#View RMSE for training and evaluation data
print(rmse_train)
print(rmse_test)

** Since the RMSE for the test data is about 131, this model predicts that there is an average error of about $ 131 with respect to the actual room price for the evaluation data. I understand. ** **

Create three models and display the linear regression analysis results of the integrated results


select_columns = ['room_type','accommodates','bed_type','bathrooms','cleaning_fee']

data_entire = data[data['room_type'] == 'Entire home/apt']
data_private = data[data['room_type'] == 'Private room']
data_share = data[data['room_type'] == 'Shared room']
dummy_data_entire = pd.get_dummies(data_entire[select_columns], drop_first=True)
dummy_data_private = pd.get_dummies(data_private[select_columns],drop_first=True)
dummy_data_share = pd.get_dummies(data_share[select_columns], drop_first=True)

X_train_e,X_test_e,y_train_e,y_test_e = train_test_split(dummy_data_entire, data_entire['y'], random_state = 1234)
X_train_p,X_test_p,y_train_p,y_test_p = train_test_split(dummy_data_private, data_private['y'], random_state = 1234)
X_train_s,X_test_s,y_train_s,y_test_s = train_test_split(dummy_data_share, data_share['y'], random_state = 1234)

model_e = LinearRegression()
model_e.fit(X_train_e, y_train_e)
pred_e_train = model_e.predict(X_train_e)
pred_e_test = model_e.predict(X_test_e)

model_p = LinearRegression()
model_p.fit(X_train_p, y_train_p)
pred_p_train = model_p.predict(X_train_p)
pred_p_test = model_p.predict(X_test_p)

model_s = LinearRegression()
model_s.fit(X_train_s, y_train_s)
pred_s_train = model_s.predict(X_train_s)
pred_s_test = model_s.predict(X_test_s)

that's all

[PYTHON] SIGNATE Quest ③ Accommodation price estimation