[PYTHON] [For beginners] kaggle exercise (merucari)

This time, as part of the training, I worked on the past kaggle competition. I tried to summarize it briefly.

Mercari Price Suggestion Challenge

From Mercari's product information, we will use Ridge regression to predict the price.

1. Preparation of module


import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from scipy.sparse import csr_matrix, hstack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_log_error

## 2. Data preparation Read the data.

train = pd.read_csv('train.tsv', sep='\t')
test = pd.read_csv('test.tsv', sep='\t')

Check the number of data.


print(train.shape)
print(test.shape)

# (1482535, 8)
# (693359, 7)

Combine train and test data.


all_data = pd.concat([train, test])
all_data.head()

image.png

Check the basic information of the data.


all_data.info(null_counts=True)

'''
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175894 entries, 0 to 693358
Data columns (total 9 columns):
brand_name           1247687 non-null object
category_name        2166509 non-null object
item_condition_id    2175894 non-null int64
item_description     2175890 non-null object
name                 2175894 non-null object
price                1482535 non-null float64
shipping             2175894 non-null int64
test_id              693359 non-null float64
train_id             1482535 non-null float64
dtypes: float64(3), int64(2), object(4)
memory usage: 166.0+ MB
'''

Examine the unique number of each column data (do not count duplicates).


print(all_data.brand_name.nunique())
print(all_data.category_name.nunique())
print(all_data.name.nunique())
print(all_data.item_description.nunique())

# 5289
# 1310
# 1750617
# 1862037

## 3. Pretreatment Preprocess the data for each column.

Since there is a lot of character data this time, we will arrange the data using BoW vector and TF-IDF.

At that time, the amount of data for other label-encoded features becomes too large. Convert it to a sparse matrix (matrix with many 0s = sparse matrix) and compress it.

# name

cv = CountVectorizer()
name = cv.fit_transform(all_data.name)

# item_description

all_data.item_description.fillna(value='null', inplace=True)

tv = TfidfVectorizer()
item_description = tv.fit_transform(all_data.item_description)

# category_name

all_data.category_name.fillna(value='null', inplace=True)

lb = LabelBinarizer(sparse_output=True)
category_name = lb.fit_transform(all_data.category_name)
# brand_name

all_data.brand_name.fillna(value='null', inplace=True)

brand_name = lb.fit_transform(all_data.brand_name)

# item_condition_id, shipping

onehot_cols = ['item_condition_id', 'shipping']
onehot_data = csr_matrix(pd.get_dummies(all_data[onehot_cols], sparse=True))

Finally, combine these data and convert them to sparse matrix data.


X_sparse = hstack((name, item_description, category_name, brand_name, onehot_data)).tocsr()

## 4. Creating a model About join data all_data The train data has an objective variable, but the test data does not Keep the amount of data in X the same size as y (= the number of rows of tran data).
nrows = train.shape[0]
X = X_sparse[:nrows]

Because y (price data) has variations in the data, it affects the forecast results. Standardization is fine, but this time we will do logarithmic conversion.

In addition, conversion is performed with $ \ log (y + 1) $ so that there is no problem even if the value of y is 0.


y = np.log1p(train.price)
y[:5]

'''
0    2.397895
1    3.970292
2    2.397895
3    3.583519
4    3.806662
Name: price, dtype: float64
'''

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

ridge = Ridge()
ridge.fit(X_train, y_train)

'''
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
'''

## 5. Performance evaluation

y_pred = ridge.predict(X_test)

This time, we will evaluate using the RMSE (slightly improved for competition) index.

I logarithmically transformed y before modeling, so I need to undo it after modeling. The processing is performed in the evaluation formula.


def rmse(y_test, y_pred):
    return np.sqrt(mean_squared_log_error(np.expm1(y_test), np.expm1(y_pred)))

rmse(y_test, y_pred)

# 0.4745184301527575

From the above, we were able to predict and evaluate prices from Mercari's product information.

This time, I have compiled an article for beginners. If you find it helpful, I would appreciate it if you could do LGBT.

Thank you for reading.

Recommended Posts

[For beginners] kaggle exercise (merucari)
[For Kaggle beginners] Titanic (LightGBM)
Python Exercise for Beginners # 2 [for Statement / While Statement]
[Kaggle for super beginners] Titanic (Logistic regression)
Roadmap for beginners
Challenges for the Titanic Competition for Kaggle Beginners
Spacemacs settings (for beginners)
python textbook for beginners
Dijkstra algorithm for beginners
OpenCV for Python beginners
Python Exercise for Beginners # 1 [Basic Data Types / If Statements]
Kaggle for the first time (kaggle ①)
Learning flow for Python beginners
Linux distribution recommended for beginners
■ Kaggle Practice for Beginners --Introduction of Python --by Google Colaboratory
CNN (1) for image classification (for beginners)
Python3 environment construction (for beginners)
Overview of Docker (for beginners)
Python #function 2 for super beginners
Seaborn basics for beginners ④ pairplot
Basic Python grammar for beginners
100 Pandas knocks for Python beginners
Python for super beginners Python #functions 1
Python #list for super beginners
~ Tips for beginners to Python ③ ~
Reference resource summary (for beginners)
Linux command memorandum [for beginners]
Convenient Linux shortcuts (for beginners)
[Explanation for beginners] TensorFlow tutorial MNIST (for beginners)
Pandas basics for beginners ① Reading & processing
TensorFlow MNIST For ML Beginners Translation
Decision tree (for beginners) -Code edition-
Pandas basics for beginners ⑧ Digit processing
[For non-programmers] How to walk Kaggle
Python for super beginners Python # dictionary type 1 for super beginners
Seaborn basics for beginners ② Histogram (distplot)
[For beginners] Django -Development environment construction-
[For beginners] Script within 10 lines (1.folium)
Logistic Regression (for beginners) -Code Edition-
What is scraping? [Summary for beginners]
Python #index for super beginners, slices
<For beginners> python library <For machine learning>
TensorFlow Tutorial MNIST For ML Beginners
Frequently used Linux commands (for beginners)
[Must-see for beginners] Basics of Linux
Python #len function for super beginners
Beginners use Python for web scraping (1)
Run unittests in Python (for beginners)
What is xg boost (1) (for beginners)
Beginners use Python for web scraping (4) ―― 1
Python #Hello World for super beginners
Linear regression (for beginners) -Code edition-
Python for super beginners Python # dictionary type 2 for super beginners
Pandas basics summary link for beginners
[For beginners] Process monitoring using cron
LSTM (1) for time series forecasting (for beginners)
[Deprecated] Chainer v1.24.0 Tutorial for beginners
TensorFlow Tutorial -MNIST For ML Beginners
Ridge Regression (for beginners) -Code Edition-
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_3
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_1