[PYTHON] Reasonable price estimation of Mercari by machine learning

Introduction

It can be difficult to know what is really worth it. Less detail can make a big difference in price. For example, one of these sweaters is $ 335 and the other is $ 9.99. Can you guess which is which?

スクリーンショット 2019-12-25 1.52.54.png

Given the number of products sold online, pricing products is even more difficult. The price of clothing has strong seasonal price trends and is strongly influenced by the brand name, but the price of electronic devices fluctuates based on the product specifications.

Japan's largest community-driven shopping app is deeply aware of this issue. It is difficult to provide a good price offer to the seller, as the seller can put anything or anything in the Mercari marketplace.

About Mercari Price Suggestion Challenge

スクリーンショット 2019-12-25 1.49.29.png

The Mercari Price Suggestion Challenge is a competition that estimates the "reasonable price" of a product from the actual product data that was put up for sale. Product data includes product name, product description, product status, brand name, category name, etc., and based on these, machine learning is used to predict the selling price.

The product dataset is published by the North American version of Mercari, so anyone can get it. https://www.kaggle.com/c/mercari-price-suggestion-challenge/data

This time, I would like to use this data to estimate the appropriate price.

Data type

スクリーンショット 2019-12-23 20.58.11.png

train.tsv has data of 1.5 million items actually listed. All notations are in English because of the data of the North American version of Mercari. The product is described from 8 columns.

column Description
train_id User Post ID
name Product name
item_condition_id Product condition
category_name Product category
brand_name brand name
price Selling price (dollars)
shipping Shipping cost (exhibitor or purchaser)
item_description Product description

These data are divided into train and test, and the selling price is predicted by machine learning.

System configuration

  1. Data acquisition
  2. trian.tsv (data file)
  3. Data preprocessing
  4. Defect processing and type conversion
  5. Vectorization based on the number of occurrences from product names and category names
  6. Extraction of features in product description
  7. Brand name labeling
  8. Quantitative variableization of product status and shipping costs
  9. Model building
  10. Hyperparameter optimization 2. Ridge + LightGBM
  11. Model evaluation
  12. Data frame
  13. Visualization
  14. Overall evaluation

The execution environment is on Google Colaboratory. Since the number of data is extremely large, it will take time unless it is in a GPU environment.

Please refer to here for Goggle Colboratory. Google Colaboratory overview and usage procedure (TensorFlow and GPU can be used)

Accuracy evaluation method

スクリーンショット 2019-12-25 1.05.18.png

RSMLE is used when you want to express a distribution close to ** lognormal distribution **, and the ** error between the measured value and the predicted value as a ratio or ratio ** instead of a width.

Looking at the figure above, the histogram of commodity prices looks like a lognormal distribution. Also, for example The error widths of (1000, 5000) and (100000, 104000) are 4000 each other, but the error ratios are different and this difference is large.

From that point, the estimated price seems to be suitable for the evaluation method by RMSLE.

Data preprocessing

Not only train.tsv but also test.tsv is published, but since it does not have a correct answer label, the data obtained by removing about 10,000 items from train.tsv is used as test data.

Overall data (* 1482535, 8 )-> (train_df ( 1472535, 8 ), test_df ( 10000, 7 *))

Missing and type conversion

There are many blanks in categories, brands, and product descriptions. In machine learning, it is normal to process defects, so fill in the blanks with the following function. As a result of missing, brand "missing" accounted for 42% of the total.

def handle_missing_inplace(dataset):
    dataset['category_name'].fillna(value="Other", inplace=True)
    dataset['brand_name'].fillna(value='missing', inplace=True)
    dataset['item_description'].fillna(value='None', inplace=True)

The brand is cut before the type conversion. Since there are about 5000 types of brands, brand names that appear extremely few times are not very useful for learning, so enter the same "missing" as the blank.

pop_brands = df["brand_name"].value_counts().index[:NUM_BRANDS]
df.loc[~df["brand_name"].isin(pop_brands), "brand_name"] = "missing"

After cutting about half, the minimum number of appearances was 4 times.

スクリーンショット 2019-12-23 21.47.14.png

Converts text data to categorical type. This is because dummy variables are created in later processing.

def to_categorical(dataset):
    dataset['category_name'] = dataset['category_name'].astype('category')
    dataset['brand_name'] = dataset['brand_name'].astype('category')
    dataset['item_condition_id'] = dataset['item_condition_id'].astype('category')

Text feature extraction with CountVectrizer

Applies CountVectorizer to product names and category names. To put it simply, CountVectorizer is vectorized according to the number of occurrences. For example, if you perform Count Vectorizer on three product names,'MLB Cincinnati Reds T Shirt Size XL',' AVA-VIV Blouse', and'Leather Horse Statues', they will be vectorized as follows.

スクリーンショット 2019-12-23 22.23.43.png

In addition, because the product name is entered by the seller, there may be typographical errors in the word or fixed words or numbers that appear only in specific sentences. With these in mind, add the option min_df to CountVectorizer. min_df means to exclude words that appear less than min_df%.

count_name = CountVectorizer(min_df=NAME_MIN_DF)
X_name = count_name.fit_transform(df["name"])

count_category = CountVectorizer()
X_category = count_category.fit_transform(df["category_name"])

Text feature extraction with TfidfVectorizer

Unlike CountVectorizer, TfidfVectorizer considers not only the number of occurrences of a word but also the rarity of the word. For example, words that exist in every sentence such as "desu" and "masu", and articles such as "a" and "the" appear frequently in English, and are greatly dragged by such words in Count Vectorizer. Instead, it is used when you want to vectorize by focusing on the importance of words.

In other words, TfidfVecotrizer means "weighting words that appear frequently in one document and infrequently in another document to give them high importance."

From the above points, the product description will be vectorized by TfidfVectorizer.

スクリーンショット 2019-12-18 15.28.04.png

Then, the table is as shown above, and the articles and conjunctions have strong tfidf values. Specify stop_word ='english' because such words still have no meaning in learning.

Next, the figure on the left shows the bottom 10 tfidf values. If the tfidf value is extremely small, it doesn't make much sense, so delete it. Also, instead of taking tfidf for one word, take tfidf for consecutive words. For example, let's set n-gram with the saying "an apple a day keeps the doctor away" (one apple a day keeps the doctor away).

n-gram(1, 2)

{'an': 0, 'apple': 2, 'day': 5, 'keeps': 9, 'the': 11,'doctor':7,'away': 4,
 'an apple': 1, 'apple day': 3, 'day keeps': 6, 'keeps the': 10,
 'the doctor': 12, 'doctor away': 8}

n-gram(1, 3)

{'an': 0, 'apple': 3, 'day': 7, 'keeps': 12, 'the': 15, 'doctor': 10,'away': 6,
 'an apple': 1, 'apple day': 4, 'day keeps': 8, 'keeps the': 13,'the doctor': 16,
 'doctor away': 11, 'an apple day': 2, 'apple day keeps': 5, 'day keeps the': 9,
 'keeps the doctor': 14, 'the doctor away': 17}

In this way, as the n-gram range increases, the characteristics of the text are captured in more detail, and useful data is acquired. With the addition of options, it looks like the figure on the right.

スクリーンショット 2019-12-24 15.14.11.png

The final tfidf will look like the figure below. The one with the highest tfidf value is "description", and you can see that this is influenced by "Not description yet". You can see that new and used items such as "new" and "used" also affect the price.

スクリーンショット 2019-12-18 16.44.05.png

tfidf_descp = TfidfVectorizer(max_features = MAX_FEAT_DESCP,
                              ngram_range = (1,3),
                              stop_words = "english")
X_descp = tfidf_descp.fit_transform(df["item_description"])

Binarization by Label Binarizer

As I mentioned earlier, there are about 5,000 types of brands, and as a result of the cutting process, there are about 2,500 types of brands. Label these with 0 or 1. Since there is a lot of data, set sparse_output = True and execute.

label_brand = LabelBinarizer(sparse_output=True)
X_brand = label_brand.fit_transform(df["brand_name"])

Dummy variable

Dummy variables are a technique for converting non-numeric data into numbers. Specifically, it converts non-numeric data into a sequence of only "0" and "1". Here, dummy variables are created for the product status and shipping costs.

X_dummies = scipy.sparse.csr_matrix(pd.get_dummies(df[[
    "item_condition_id", "shipping"]], sparse = True).values, dtype=int)

Now that we have processed all the columns, we will combine all the arrays and put them in the model.

X = scipy.sparse.hstack((X_dummies,
                         X_descp,
                         X_brand,
                         X_category,
                         X_name)).tocsr()

Model learning

Parameter description

Since all parameters cannot be explained, some parameters are briefly summarized.

Ridge parameter

option desc
alpha Degree of normalization to prevent overfitting
max_iter Maximum number of learning iterations
tol Subject to a score increase of tol or higher

alpha

It is possible to build a model that is overfitted to the given data and has a small error for the given training data, but it is called "** overfitting " that proper prediction for unknown data cannot be made well. say. Therefore, overfitting can be prevented by setting restrictions on parameter learning. Such a limitation is called " normalization **".

LightGBM parameters

option description
n_esimators Number of decision trees
learning_rate Weight of each tree
max_depth Maximum depth of each tree
num_leaves Number of leaves
min_child_samples Minimum number of data contained in end node
n_jobs Number of parallel processes

learning_rate

--Generally speaking, accuracy increases, but overfitting becomes easier. --If it is too small, the calculation load will be large and the processing will take time.

n_estimatiors

--The most important parameters in Random Forest

Hyperparameter optimization

Ridge

First, search for the optimum value of alpha. Move alpha in the range of 0.05 to 75 to visualize the effect on accuracy

スクリーンショット 2019-12-20 1.26.15.png

From the figure, the minimum value * RMSLE 0.4745938085035464 * was obtained when alpha = 3.0.

Next, as a result of verifying the maximum number of searches max_iter in all ranges, no improvement in accuracy was obtained. Also, the higher the tol value, the less accurate it was.

スクリーンショット 2019-12-20 3.06.27.png

From the above, the Ridge parameter is modeled with alpha = 3.

LGBM For the parameter adjustment of LGBM, I referred to the documents. https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html

It seems to be a standard practice to start by setting learning_rate and n_estimatiors as the first step in adjusting the parameters of LGBM. To improve accuracy, learning_rate should be small and n_estimatiors should be large. Move learning_rate in the range of 0.05 to 0.7 to adjust n_estimatiors.

Next, after setting learning_rate and n_estimatiors, move num_leaves.

(num_leaves = 20) RMSLE 0.4620242411418184        ↓ (num = 31) RMSLE 0.4569169142862856        ↓ (num = 40) RMSLE 0.45587232757584967

It turns out that increasing num_leaves overall improves accuracy. Overall here means even if other parameters are adjusted.

However, while adjusting each parameter, if num_leaves was raised too much, ** overfitting ** would occur, and in some cases a good score could not be obtained. I had to adjust it well with other parameters.

When learning_rate = 0.7 max_depth = 15, num_leaves = 30 RMSLE 44.650714399639845

The final LGBM model looks like this:

lgbm_params = {'n_estimators': 1000, 'learning_rate': 0.4, 'max_depth': 15,
               'num_leaves': 40, 'subsample': 0.9, 'colsample_bytree': 0.8,
               'min_child_samples': 50, 'n_jobs': 4}

Model evaluation

Rideg + LGBM is used to calculate the predicted value. LGBM has a better score than Ridge, but by combining the two models, you can improve the accuracy. Ridge RMSL error on dev set: 0.47459370995217937

LGBM RMSL error on dev set: 0.45317097672035855

Ridge + LGBM RMSL error on dev set: 0.4433081424824549

This accuracy is an estimated error range of 18.89 to 47.29 for a $ 30 item.

price is the predicted value by Ridge + LGBM, and real_price is the measured value. About 7553 out of 10,000 test data had an error of less than $ 10.

スクリーンショット 2019-12-19 17.51.00.png


Residual plot with log スクリーンショット 2019-12-15 16.16.47.png


Distribution of actual and predicted prices スクリーンショット 2019-12-20 4.28.56.png


I simply took the difference, but there are about 90 products that have a difference of 100 dollars or more between the predicted value and the measured value. Since this dataset is data from two years ago, Apple Watch etc. are relatively new products, so you can see that the number of data is small and it can not be predicted well. Also, it ’s Mercari. It's good, but not everything can be predicted well because of the pricing based on personal values. The coach bag was actually sold for about $ 9 ...

スクリーンショット 2019-12-20 5.04.42.png

Completed code

import numpy as np
import pandas as pd
import scipy

from sklearn.linear_model import Ridge
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer

NUM_BRANDS = 2500
NAME_MIN_DF = 10
MAX_FEAT_DESCP = 10000

print("Reading in Data")
df = pd.read_csv('train.tsv', sep='\t')

print('Formatting Data')
shape = df.shape[0]
train_df = df[:shape-10000]
test_df = df[shape-10000:]

target = test_df.loc[:, 'price'].values
target = np.log1p(target)

print("Concatenate data")
df = pd.concat([train_df, test_df], 0)

nrow_train = train_df.shape[0]
y_train = np.log1p(train_df["price"])

def handle_missing_inplace(dataset):
    dataset['category_name'].fillna(value="Othe", inplace=True)
    dataset['brand_name'].fillna(value='missing', inplace=True)
    dataset['item_description'].fillna(value='None', inplace=True)

print('Handle missing')
handle_missing_inplace(df)

def to_categorical(dataset):
    dataset['category_name'] = dataset['category_name'].astype('category')
    dataset['brand_name'] = dataset['brand_name'].astype('category')
    dataset['item_condition_id'] = dataset['item_condition_id'].astype('category')

print('Convert categorical')
to_categorical(df)

print('Cut')
pop_brands = df["brand_name"].value_counts().index[:NUM_BRANDS]
df.loc[~df["brand_name"].isin(pop_brands), "brand_name"] = "missing"

print("Name Encoders")
count_name = CountVectorizer(min_df=NAME_MIN_DF)
X_name = count_name.fit_transform(df["name"])

print("Category Encoders")
count_category = CountVectorizer()
X_category = count_category.fit_transform(df["category_name"])

print("Descp encoders")
tfidf_descp = TfidfVectorizer(max_features = MAX_FEAT_DESCP,
                              ngram_range = (1,3),
                              stop_words = "english")
X_descp = tfidf_descp.fit_transform(df["item_description"])

print("Brand encoders")
label_brand = LabelBinarizer(sparse_output=True)
X_brand = label_brand.fit_transform(df["brand_name"])

print("Dummy Encoders")
X_dummies = scipy.sparse.csr_matrix(pd.get_dummies(df[[
    "item_condition_id", "shipping"]], sparse = True).values, dtype=int)

X = scipy.sparse.hstack((X_dummies,
                         X_descp,
                         X_brand,
                         X_category,
                         X_name)).tocsr()

print("Finished to create sparse merge")

X_train = X[:nrow_train]
X_test = X[nrow_train:]

model = Ridge(solver='auto', fit_intercept=True, alpha=3)

print("Fitting Rige")
model.fit(X_train, y_train)

print("Predicting price Ridge")
preds1 = model.predict(X_test)

def rmsle(Y, Y_pred):
    assert Y.shape == Y_pred.shape
    return np.sqrt(np.mean(np.square(Y_pred - Y )))

print("Ridge RMSL error on dev set:", rmsle(target, preds1))

def rmsle_lgb(labels, preds):
    return 'rmsle', rmsle(preds, labels), False

train_X, valid_X, train_y, valid_y = train_test_split(X_train, y_train, test_size=0.3, random_state=42)

lgbm_params = {'n_estimators': 1000, 'learning_rate': 0.4, 'max_depth': 15,
               'num_leaves': 40, 'subsample': 0.9, 'colsample_bytree': 0.8,
               'min_child_samples': 50, 'n_jobs': 4}

model = LGBMRegressor(**lgbm_params)
print('Fitting LGBM')
model.fit(train_X, train_y,
          eval_set=[(valid_X, valid_y)],
          eval_metric=rmsle_lgb,
          early_stopping_rounds=100,
          verbose=True)

print("Predict price LGBM")
preds2 = model.predict(X_test)

print("LGBM RMSL error on dev set:", rmsle(target, preds2))

preds = (preds1 + preds2) / 2

print("Ridge + LGBM RMSL error on dev set:", rmsle(target, preds))

test_df["price1"] = np.expm1(preds1)
test_df['price2']=np.exp(preds2)
test_df['price']= np.expm1(preds)
test_df['real_price'] = np.expm1(target)

Summary

As a result of estimating the fair price, we got a better score than expected. I think that the accuracy would have improved a little more if the min_df value, n-gram range setting, etc. were changed in the preprocessing part, and if the text was not just applied to tfidf but made more detailed corrections. .. Also, the values of products are different for each person, so you can only predict to some extent. If you find it helpful, please give me a good button!

Recommended Posts

Reasonable price estimation of Mercari by machine learning
Judgment of igneous rock by machine learning ②
Classification of guitar images by machine learning Part 1
Stock price forecast by machine learning Numerai Signals
[Translation] scikit-learn 0.18 Tutorial Introduction of machine learning by scikit-learn
Classification of guitar images by machine learning Part 2
Importance of machine learning datasets
4 [/] Four Arithmetic by Machine Learning
Predict the presence or absence of infidelity by machine learning
Basic understanding of depth estimation by mono camera (Deep Learning)
Stock price forecast by machine learning Let's get started Numerai
Significance of machine learning and mini-batch learning
Machine learning summary by Python beginners
Machine learning ③ Summary of decision tree
Stock price forecast by machine learning is so true Numerai Signals
A memorandum of scraping & machine learning [development technique] by Python (Chapter 4)
A memorandum of scraping & machine learning [development technique] by Python (Chapter 5)
Machine learning algorithm (generalization of linear regression)
Stock price forecast using machine learning (scikit-learn)
Making Sandwichman's Tale by Machine Learning ver4
[Learning memo] Basics of class by python
[Failure] Find Maki Horikita by machine learning
Four arithmetic operations by machine learning 6 [Commercial]
Machine learning
Machine learning algorithm (implementation of multi-class classification)
[Machine learning] Supervised learning using kernel density estimation
Python learning memo for machine learning by Chainer Chapter 13 Basics of neural networks
Stock price forecast using machine learning (regression)
[Machine learning] List of frequently used packages
Python & Machine Learning Study Memo ④: Machine Learning by Backpropagation
Python learning memo for machine learning by Chainer until the end of Chapter 2
Judge the authenticity of posted articles by machine learning (Google Prediction API).
Machine Learning: Image Recognition of MNIST by using PCA and Gaussian Native Bayes
Is it possible to eat stock price forecasts by machine learning [Implementation plan]
Implementation of a model that predicts the exchange rate (dollar-yen rate) by machine learning
Predict short-lived works of Weekly Shonen Jump by machine learning (Part 1: Data analysis)
Machine learning memo of a fledgling engineer Part 1
Beginning of machine learning (recommended teaching materials / information)
Try to forecast power demand by machine learning
Python & Machine Learning Study Memo ⑤: Classification of irises
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
Python & Machine Learning Study Memo ②: Introduction of Library
Full disclosure of methods used in machine learning
[Machine learning] Supervised learning using kernel density estimation Part 2
Basic understanding of stereo depth estimation (Deep Learning)
[Machine learning] Supervised learning using kernel density estimation Part 3
List of links that machine learning beginners are learning
Parallel learning of deep learning by Keras and Kubernetes
Overview of machine learning techniques learned from scikit-learn
About the development contents of machine learning (Example)
Summary of evaluation functions used in machine learning
Classify machine learning related information by topic model
Improvement of performance metrix by two-step learning model
Machine learning memo of a fledgling engineer Part 2
Get a glimpse of machine learning in Python
Python & Machine Learning Study Memo ⑦: Stock Price Forecast
Deep learning learned by implementation (segmentation) ~ Implementation of SegNet ~
Try using Jupyter Notebook of Azure Machine Learning
A story about data analysis by machine learning
Arrangement of self-mentioned things related to machine learning
Causal reasoning using machine learning (organization of causal reasoning methods)