Introduction

Due to the influence of Corona, the time spent at home has increased, so from April this year I started to work hard on machine learning. In the process, I had many opportunities to work on Kaggle and SIGNATE competitions, so I decided to write an article on Qiita as one output. This time, we will work on SIGNATE's exercise "Forecasting accommodation prices for private lodging services". The purpose is to create a benchmark for deeper analysis and insight. The code created this time is left in Jupyter Notebook format in here.

1. Understanding issues

In this task, we will work on building a model that predicts the accommodation price for each property using the property data posted on Airbnb, which is a private lodging service. At Airbnb, property owners set room prices based on room size and location, but it doesn't seem easy to set reasonable rates.

2, EDA (Exploratory Data Analysis)

Import the required library and load the training data, validation data, and submission data respectively.

#Library import
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
from scipy.stats import norm,skew

from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb

import warnings
warnings.filterwarnings('ignore')

#Data reading
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sub = pd.read_csv('sample_submit.csv',names=('id','pred'))

#Specify the number of lines as a variable
ntrain = train.shape[0]
ntest = test.shape[0]

#Check the number of data
train.shape, test.shape

#((55583, 29), (18528, 28))

It can be confirmed that the training data contains 55583 data and the verification data contains 18528 data. Since the accommodation price ** y ** that you want to finally predict is included only in the training data, the number of columns is one more than the verification data. Shows the first 5 lines of training and validation data.

train.head()

スクリーンショット 2020-09-23 19.24.27.png

test.head()

スクリーンショット 2020-09-23 19.24.43.png ** amenities ** and ** name ** are like strings. Since the classifier cannot process character data, it is necessary to consider countermeasures. When all 29 columns of training data were checked, the columns containing character strings that could not be dealt with by category conversion were as follows.

Header name	Explanation
amenities	Amenities
description	Explanation
name	Property Name
thumbnail_url	Thumbnail image link

Visualize the distribution of the objective variable ** y (accommodation price) **.

sns.distplot(train['y']);

スクリーンショット 2020-09-20 20.40.45.png As you can see in the competition outline, the task type is regression. For regression, it is important that the values of the objective variables follow a normal distribution. It turns out that ** y ** is far from the normal distribution, so we need to deal with it. Also check the current skewness and kurtosis.

#Show skewness and kurtosis
print("skewness: %f" % train['y'].skew())
print("kurtosis: %f" % train['y'].kurt())

#skewness: 4.264338
#kurtosis: 26.030945

The ** skewness ** was 4.26 and the ** kurtosis ** was 26.03. It's pretty biased. For the main categorical variables, we will also use a boxplot to explore their relationship with ** y **. (accommodates)

var = 'accommodates'
data = pd.concat([train['y'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="y", data=data)
fig.axis(ymin=0, ymax=2100);
plt.xticks(rotation=90);

スクリーンショット 2020-09-20 20.52.50.png → The property with the highest median price can accommodate 16 people. The host side (the side that rents out the property) seems to tend to set the accommodation price higher when the property has a large capacity. (bathrooms)

var = 'bathrooms'
data = pd.concat([train['y'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="y", data=data)
fig.axis(ymin=0, ymax=2100);
plt.xticks(rotation=90);

スクリーンショット 2020-09-20 20.55.03.png (bedrooms)

var = 'bedrooms'
data = pd.concat([train['y'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="y", data=data)
fig.axis(ymin=0, ymax=2100);
plt.xticks(rotation=90);

スクリーンショット 2020-09-20 20.55.18.png (beds)

var = 'beds'
data = pd.concat([train['y'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="y", data=data)
fig.axis(ymin=0, ymax=2100);
plt.xticks(rotation=90);

スクリーンショット 2020-09-20 20.55.37.png It seems that the more variables there are, the higher the price. Here, as an example based on the visualization results, it is hypothesized that adding ** {bath | bed} rooms ** may lead to improved accuracy.

3, data preprocessing

At the "getting the data" stage, we know that the objective variables do not follow a normal distribution. As a countermeasure, take the logarithm of ** y ** and put it in a pseudo-normal distribution. Let's also make a diagnosis by normal QQ plot by taking the residuals to see if it is approaching a normal distribution.

#Before processing
sns.distplot(train['y'] , fit=norm);

#Get parameters
(mu, sigma) = norm.fit(train['y'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Visualization
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best')
plt.ylabel('Frequency')
plt.title('y distribution')

#Applying regular QQ plot
fig = plt.figure()
res = stats.probplot(train['y'], plot=plt)
plt.show()

スクリーンショット 2020-09-21 22.41.37.png

#After treatment
# log1p(numpy function)Apply, take logarithm
train["y"] = np.log1p(train["y"])

#Check the distribution after application
sns.distplot(train['y'] , fit=norm);

#Get parameters
(mu, sigma) = norm.fit(train['y'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Visualization
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best')
plt.ylabel('Frequency')
plt.title('y distribution')

#Applying regular QQ plot
fig = plt.figure()
res = stats.probplot(train['y'], plot=plt)
plt.show()

スクリーンショット 2020-09-21 22.42.06.png If you take the logarithm, you can see that it is close to the normal distribution. You can see that the residuals of ** y ** are lined up on the red 45 degree line, although there are some deviations in the regular QQ plot. Therefore, it can be said that the error of the objective variable also follows the normal distribution.

#Extract y column
train_y = train['y']
train_y.shape

#(55583,)

#Combine training data and validation data
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['y','id'], axis=1, inplace=True)
print("all_data size : {}".format(all_data.shape))

#all_data size : (74111, 27)

When analyzing data, the processing of missing values always follows. Let's handle the missing values in each column again.

#Check the percentage of missing values
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(15)

スクリーンショット 2020-09-20 21.06.46.png

It seems that 13 variables have missing values. Let's decide how to replenish or drop from the data while looking at the original data and visualization results. This time I will drop ** thumbnail_url ** and ** zipcode **.

#Convert all year / month related variables to floating point type and fill missing values with 0
for c in ('first_review','last_review','host_since'):
    all_data[c] = pd.to_datetime(all_data[c])
    all_data[c] = pd.DatetimeIndex(all_data[c])
    all_data[c] = np.log(all_data[c].values.astype(np.float64))
    all_data[c] = all_data[c].fillna(0)

#Fill in missing values with 1
for c in ('bathrooms','beds','bedrooms'):
    all_data[c] = all_data[c].fillna(1)

#Fill in missing values with None
for c in ('host_response_rate','neighbourhood','host_identity_verified','host_has_profile_pic'):
    all_data[c] = all_data[c].fillna('None')

#Fill with median
all_data['review_scores_rating'].fillna(all_data['review_scores_rating'].median(),inplace=True)

#Drop unused columns
all_data = all_data.drop(['thumbnail_url','zipcode'],axis=1)

Now you have dealt with the missing values. Check the data type here.

all_data.dtypes

スクリーンショット 2020-09-23 22.17.09.png There are quite a lot of object types. Since the model cannot be trained as it is, ** Label Encoder ** is used.

#Label encoding
cols =  ('bed_type','cancellation_policy','city','cleaning_fee','host_identity_verified','host_has_profile_pic','host_response_rate','instant_bookable','property_type','room_type','neighbourhood')

for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(all_data[c].values)) 
    all_data[c] = lbl.transform(list(all_data[c].values))

#Check data type
all_data.dtypes

スクリーンショット 2020-09-23 22.21.08.png I was able to convert it to almost a numeric type.

Natural Language Processing (NLP)

Many competitions dealing with text data are held at kaggle and SIGNATE. Since it handles natural languages such as Japanese and English, it is called Natural Language Processing (NLP) and has been established as a field of machine learning. Compared to table data, there is no big difference in the learning / prediction stage because it does not deviate from the frame of supervised machine learning due to the nature of the competition. On the other hand, as far as preprocessing is concerned, there are various types such as stem extraction and word vectorization. This time, we will deal with it with a simple method that only counts the number of characters in the sentence.

#Count the number of characters in the target column
for c in ('amenities','description','name'):
    all_data[c] = all_data[c].apply(lambda x: sum(len(word) for word in str(x).split(" ")))

all_data.dtypes

スクリーンショット 2020-10-05 23.57.53.png You can see that they are all numeric type. By the way, the histogram of ** description ** is displayed as an example.

#Histogram of description
plt.hist(all_data['description'],alpha=0.5)
plt.xlabel('description')
plt.ylabel('count')
plt.show()

スクリーンショット 2020-10-06 0.12.21.png You can see that the overwhelming majority of properties have a description of 800 characters or more. Finally, let's add the ** {bath | bed} rooms ** that were hypothesized at the time of EDA.

#Create new features
#Add up the number of bathrooms and bedrooms
all_data['total_rooms'] = all_data['bathrooms'] + all_data['bedrooms']

4, learning

This time, we will use LightGBM, which is most often used in competitions recently, in GBDT (gradient boosting tree) from learning to prediction.

#Divide into training data and verification data
train = all_data[:ntrain]
test = all_data[ntrain:]

#Learning with LGBM Regressor
model = lgb.LGBMRegressor(num_leaves=100,learning_rate=0.05,n_estimators=1000)
model.fit(train,train_y)

#Predict validation data with trained model
pred = np.expm1(model.predict(test))

Check which variables contribute to the model from feature_importances_.

#Visualize variable importance
ranking = np.argsort(-model.feature_importances_)
f, ax = plt.subplots(figsize=(11, 9))
sns.barplot(x=model.feature_importances_[ranking],y=train.columns.values[ranking], orient='h')
ax.set_xlabel('feature importance')
plt.tight_layout()
plt.show()

スクリーンショット 2020-10-06 0.15.39.png The top two importance levels were latitude and longitude. Finally, write it out to a csv file and submit it.

#Submission
sub['pred'] = pred
sub.to_csv('sub.csv',index=False,header=None)

5, result

As a result of preprocessing and learning, it was ranked 44/163 (as of October 6, 2020). It's a decent result, but there is room for improvement in every respect. (Utilization of dropped features, natural language processing, model selection, cross-validation, parameter adjustment, ensemble ...) Even better results can be obtained by such ingenuity, so I hope this article will help you.

6, references

・ Kaggle "House Prices: Advanced Regression Techniques" published notebook "Stacked Regressions: Top 4% on LeaderBoard"

[PYTHON] [SIGNATE] Challenge for forecasting accommodation prices for private lodging services