This time, as part of the training, I worked on the past kaggle competition. I tried to summarize it briefly.
・ Mercari Price Suggestion Challenge
From Mercari's product information, we will use Ridge regression to predict the price.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from scipy.sparse import csr_matrix, hstack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_log_error
train = pd.read_csv('train.tsv', sep='\t')
test = pd.read_csv('test.tsv', sep='\t')
Check the number of data.
print(train.shape)
print(test.shape)
# (1482535, 8)
# (693359, 7)
Combine train and test data.
all_data = pd.concat([train, test])
all_data.head()

Check the basic information of the data.
all_data.info(null_counts=True)
'''
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175894 entries, 0 to 693358
Data columns (total 9 columns):
brand_name           1247687 non-null object
category_name        2166509 non-null object
item_condition_id    2175894 non-null int64
item_description     2175890 non-null object
name                 2175894 non-null object
price                1482535 non-null float64
shipping             2175894 non-null int64
test_id              693359 non-null float64
train_id             1482535 non-null float64
dtypes: float64(3), int64(2), object(4)
memory usage: 166.0+ MB
'''
Examine the unique number of each column data (do not count duplicates).
print(all_data.brand_name.nunique())
print(all_data.category_name.nunique())
print(all_data.name.nunique())
print(all_data.item_description.nunique())
# 5289
# 1310
# 1750617
# 1862037
Since there is a lot of character data this time, we will arrange the data using BoW vector and TF-IDF.
At that time, the amount of data for other label-encoded features becomes too large. Convert it to a sparse matrix (matrix with many 0s = sparse matrix) and compress it.
# name
cv = CountVectorizer()
name = cv.fit_transform(all_data.name)
# item_description
all_data.item_description.fillna(value='null', inplace=True)
tv = TfidfVectorizer()
item_description = tv.fit_transform(all_data.item_description)
# category_name
all_data.category_name.fillna(value='null', inplace=True)
lb = LabelBinarizer(sparse_output=True)
category_name = lb.fit_transform(all_data.category_name)
# brand_name
all_data.brand_name.fillna(value='null', inplace=True)
brand_name = lb.fit_transform(all_data.brand_name)
# item_condition_id, shipping
onehot_cols = ['item_condition_id', 'shipping']
onehot_data = csr_matrix(pd.get_dummies(all_data[onehot_cols], sparse=True))
Finally, combine these data and convert them to sparse matrix data.
X_sparse = hstack((name, item_description, category_name, brand_name, onehot_data)).tocsr()
nrows = train.shape[0]
X = X_sparse[:nrows]
Because y (price data) has variations in the data, it affects the forecast results. Standardization is fine, but this time we will do logarithmic conversion.
In addition, conversion is performed with $ \ log (y + 1) $ so that there is no problem even if the value of y is 0.
y = np.log1p(train.price)
y[:5]
'''
0    2.397895
1    3.970292
2    2.397895
3    3.583519
4    3.806662
Name: price, dtype: float64
'''
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
ridge = Ridge()
ridge.fit(X_train, y_train)
'''
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
'''
y_pred = ridge.predict(X_test)
This time, we will evaluate using the RMSE (slightly improved for competition) index.
I logarithmically transformed y before modeling, so I need to undo it after modeling. The processing is performed in the evaluation formula.
def rmse(y_test, y_pred):
    return np.sqrt(mean_squared_log_error(np.expm1(y_test), np.expm1(y_pred)))
rmse(y_test, y_pred)
# 0.4745184301527575
From the above, we were able to predict and evaluate prices from Mercari's product information.
This time, I have compiled an article for beginners. If you find it helpful, I would appreciate it if you could do LGBT.
Thank you for reading.
Recommended Posts