Introduction

It can be difficult to know what is really worth it. Less detail can make a big difference in price. For example, one of these sweaters is $ 335 and the other is $ 9.99. Can you guess which is which?

スクリーンショット 2019-12-25 1.52.54.png

Given the number of products sold online, pricing products is even more difficult. The price of clothing has strong seasonal price trends and is strongly influenced by the brand name, but the price of electronic devices fluctuates based on the product specifications.

Japan's largest community-driven shopping app is deeply aware of this issue. It is difficult to provide a good price offer to the seller, as the seller can put anything or anything in the Mercari marketplace.

About Mercari Price Suggestion Challenge

スクリーンショット 2019-12-25 1.49.29.png

The Mercari Price Suggestion Challenge is a competition that estimates the "reasonable price" of a product from the actual product data that was put up for sale. Product data includes product name, product description, product status, brand name, category name, etc., and based on these, machine learning is used to predict the selling price.

The product dataset is published by the North American version of Mercari, so anyone can get it. https://www.kaggle.com/c/mercari-price-suggestion-challenge/data

This time, I would like to use this data to estimate the appropriate price.

Data type

スクリーンショット 2019-12-23 20.58.11.png

train.tsv has data of 1.5 million items actually listed. All notations are in English because of the data of the North American version of Mercari. The product is described from 8 columns.

column	Description
train_id	User Post ID
name	Product name
item_condition_id	Product condition
category_name	Product category
brand_name	brand name
price	Selling price (dollars)
shipping	Shipping cost (exhibitor or purchaser)
item_description	Product description

These data are divided into train and test, and the selling price is predicted by machine learning.

System configuration

Data acquisition
trian.tsv (data file)
Data preprocessing
Defect processing and type conversion
Vectorization based on the number of occurrences from product names and category names
Extraction of features in product description
Brand name labeling
Quantitative variableization of product status and shipping costs
Model building
Hyperparameter optimization 2. Ridge + LightGBM
Model evaluation
Data frame
Visualization
Overall evaluation

The execution environment is on Google Colaboratory. Since the number of data is extremely large, it will take time unless it is in a GPU environment.

Please refer to here for Goggle Colboratory. Google Colaboratory overview and usage procedure (TensorFlow and GPU can be used)

Accuracy evaluation method

スクリーンショット 2019-12-25 1.05.18.png

RSMLE is used when you want to express a distribution close to ** lognormal distribution **, and the ** error between the measured value and the predicted value as a ratio or ratio ** instead of a width.

Looking at the figure above, the histogram of commodity prices looks like a lognormal distribution. Also, for example The error widths of (1000, 5000) and (100000, 104000) are 4000 each other, but the error ratios are different and this difference is large.

From that point, the estimated price seems to be suitable for the evaluation method by RMSLE.

Data preprocessing

Not only train.tsv but also test.tsv is published, but since it does not have a correct answer label, the data obtained by removing about 10,000 items from train.tsv is used as test data.

Overall data (* 1482535, 8 )-> (train_df ( 1472535, 8 ), test_df ( 10000, 7 *))

Missing and type conversion

There are many blanks in categories, brands, and product descriptions. In machine learning, it is normal to process defects, so fill in the blanks with the following function. As a result of missing, brand "missing" accounted for 42% of the total.

def handle_missing_inplace(dataset):
    dataset['category_name'].fillna(value="Other", inplace=True)
    dataset['brand_name'].fillna(value='missing', inplace=True)
    dataset['item_description'].fillna(value='None', inplace=True)

The brand is cut before the type conversion. Since there are about 5000 types of brands, brand names that appear extremely few times are not very useful for learning, so enter the same "missing" as the blank.

pop_brands = df["brand_name"].value_counts().index[:NUM_BRANDS]
df.loc[~df["brand_name"].isin(pop_brands), "brand_name"] = "missing"

After cutting about half, the minimum number of appearances was 4 times.

スクリーンショット 2019-12-23 21.47.14.png

Converts text data to categorical type. This is because dummy variables are created in later processing.

def to_categorical(dataset):
    dataset['category_name'] = dataset['category_name'].astype('category')
    dataset['brand_name'] = dataset['brand_name'].astype('category')
    dataset['item_condition_id'] = dataset['item_condition_id'].astype('category')

Text feature extraction with CountVectrizer

Applies CountVectorizer to product names and category names. To put it simply, CountVectorizer is vectorized according to the number of occurrences. For example, if you perform Count Vectorizer on three product names,'MLB Cincinnati Reds T Shirt Size XL',' AVA-VIV Blouse', and'Leather Horse Statues', they will be vectorized as follows.

スクリーンショット 2019-12-23 22.23.43.png

In addition, because the product name is entered by the seller, there may be typographical errors in the word or fixed words or numbers that appear only in specific sentences. With these in mind, add the option min_df to CountVectorizer. min_df means to exclude words that appear less than min_df%.

count_name = CountVectorizer(min_df=NAME_MIN_DF)
X_name = count_name.fit_transform(df["name"])

count_category = CountVectorizer()
X_category = count_category.fit_transform(df["category_name"])

Text feature extraction with TfidfVectorizer

Unlike CountVectorizer, TfidfVectorizer considers not only the number of occurrences of a word but also the rarity of the word. For example, words that exist in every sentence such as "desu" and "masu", and articles such as "a" and "the" appear frequently in English, and are greatly dragged by such words in Count Vectorizer. Instead, it is used when you want to vectorize by focusing on the importance of words.

In other words, TfidfVecotrizer means "weighting words that appear frequently in one document and infrequently in another document to give them high importance."

From the above points, the product description will be vectorized by TfidfVectorizer.

スクリーンショット 2019-12-18 15.28.04.png

Then, the table is as shown above, and the articles and conjunctions have strong tfidf values. Specify stop_word ='english' because such words still have no meaning in learning.

Next, the figure on the left shows the bottom 10 tfidf values. If the tfidf value is extremely small, it doesn't make much sense, so delete it. Also, instead of taking tfidf for one word, take tfidf for consecutive words. For example, let's set n-gram with the saying "an apple a day keeps the doctor away" (one apple a day keeps the doctor away).

n-gram(1, 2)

{'an': 0, 'apple': 2, 'day': 5, 'keeps': 9, 'the': 11,'doctor':7,'away': 4,
 'an apple': 1, 'apple day': 3, 'day keeps': 6, 'keeps the': 10,
 'the doctor': 12, 'doctor away': 8}

n-gram(1, 3)

{'an': 0, 'apple': 3, 'day': 7, 'keeps': 12, 'the': 15, 'doctor': 10,'away': 6,
 'an apple': 1, 'apple day': 4, 'day keeps': 8, 'keeps the': 13,'the doctor': 16,
 'doctor away': 11, 'an apple day': 2, 'apple day keeps': 5, 'day keeps the': 9,
 'keeps the doctor': 14, 'the doctor away': 17}

In this way, as the n-gram range increases, the characteristics of the text are captured in more detail, and useful data is acquired. With the addition of options, it looks like the figure on the right.

スクリーンショット 2019-12-24 15.14.11.png

The final tfidf will look like the figure below. The one with the highest tfidf value is "description", and you can see that this is influenced by "Not description yet". You can see that new and used items such as "new" and "used" also affect the price.

スクリーンショット 2019-12-18 16.44.05.png

tfidf_descp = TfidfVectorizer(max_features = MAX_FEAT_DESCP,
                              ngram_range = (1,3),
                              stop_words = "english")
X_descp = tfidf_descp.fit_transform(df["item_description"])

Binarization by Label Binarizer

As I mentioned earlier, there are about 5,000 types of brands, and as a result of the cutting process, there are about 2,500 types of brands. Label these with 0 or 1. Since there is a lot of data, set sparse_output = True and execute.

label_brand = LabelBinarizer(sparse_output=True)
X_brand = label_brand.fit_transform(df["brand_name"])

Dummy variable

Dummy variables are a technique for converting non-numeric data into numbers. Specifically, it converts non-numeric data into a sequence of only "0" and "1". Here, dummy variables are created for the product status and shipping costs.

X_dummies = scipy.sparse.csr_matrix(pd.get_dummies(df[[
    "item_condition_id", "shipping"]], sparse = True).values, dtype=int)

Now that we have processed all the columns, we will combine all the arrays and put them in the model.

X = scipy.sparse.hstack((X_dummies,
                         X_descp,
                         X_brand,
                         X_category,
                         X_name)).tocsr()

Model learning

Parameter description

Since all parameters cannot be explained, some parameters are briefly summarized.

Ridge parameter

option	desc
alpha	Degree of normalization to prevent overfitting
max_iter	Maximum number of learning iterations
tol	Subject to a score increase of tol or higher

alpha

It is possible to build a model that is overfitted to the given data and has a small error for the given training data, but it is called "** overfitting " that proper prediction for unknown data cannot be made well. say. Therefore, overfitting can be prevented by setting restrictions on parameter learning. Such a limitation is called " normalization **".

LightGBM parameters

option	description
n_esimators	Number of decision trees
learning_rate	Weight of each tree
max_depth	Maximum depth of each tree
num_leaves	Number of leaves
min_child_samples	Minimum number of data contained in end node
n_jobs	Number of parallel processes

learning_rate

--Generally speaking, accuracy increases, but overfitting becomes easier. --If it is too small, the calculation load will be large and the processing will take time.

n_estimatiors

--The most important parameters in Random Forest

Hyperparameter optimization

Ridge

First, search for the optimum value of alpha. Move alpha in the range of 0.05 to 75 to visualize the effect on accuracy

スクリーンショット 2019-12-20 1.26.15.png

From the figure, the minimum value * RMSLE 0.4745938085035464 * was obtained when alpha = 3.0.

Next, as a result of verifying the maximum number of searches max_iter in all ranges, no improvement in accuracy was obtained. Also, the higher the tol value, the less accurate it was.

スクリーンショット 2019-12-20 3.06.27.png

From the above, the Ridge parameter is modeled with alpha = 3.

LGBM For the parameter adjustment of LGBM, I referred to the documents. https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html

It seems to be a standard practice to start by setting learning_rate and n_estimatiors as the first step in adjusting the parameters of LGBM. To improve accuracy, learning_rate should be small and n_estimatiors should be large. Move learning_rate in the range of 0.05 to 0.7 to adjust n_estimatiors.

Next, after setting learning_rate and n_estimatiors, move num_leaves.

(num_leaves = 20) RMSLE 0.4620242411418184 　　　　　　　↓ (num = 31) RMSLE 0.4569169142862856 　　　　　　　↓ (num = 40) RMSLE 0.45587232757584967

It turns out that increasing num_leaves overall improves accuracy. Overall here means even if other parameters are adjusted.

However, while adjusting each parameter, if num_leaves was raised too much, ** overfitting ** would occur, and in some cases a good score could not be obtained. I had to adjust it well with other parameters.

When learning_rate = 0.7 max_depth = 15, num_leaves = 30 RMSLE 44.650714399639845

The final LGBM model looks like this:

lgbm_params = {'n_estimators': 1000, 'learning_rate': 0.4, 'max_depth': 15,
               'num_leaves': 40, 'subsample': 0.9, 'colsample_bytree': 0.8,
               'min_child_samples': 50, 'n_jobs': 4}

Model evaluation

Rideg + LGBM is used to calculate the predicted value. LGBM has a better score than Ridge, but by combining the two models, you can improve the accuracy. Ridge RMSL error on dev set: 0.47459370995217937

LGBM RMSL error on dev set: 0.45317097672035855

Ridge + LGBM RMSL error on dev set: 0.4433081424824549

This accuracy is an estimated error range of 18.89 to 47.29 for a $ 30 item.

price is the predicted value by Ridge + LGBM, and real_price is the measured value. About 7553 out of 10,000 test data had an error of less than $ 10.

スクリーンショット 2019-12-19 17.51.00.png

Residual plot with log スクリーンショット 2019-12-15 16.16.47.png

Distribution of actual and predicted prices スクリーンショット 2019-12-20 4.28.56.png

I simply took the difference, but there are about 90 products that have a difference of 100 dollars or more between the predicted value and the measured value. Since this dataset is data from two years ago, Apple Watch etc. are relatively new products, so you can see that the number of data is small and it can not be predicted well. Also, it ’s Mercari. It's good, but not everything can be predicted well because of the pricing based on personal values. The coach bag was actually sold for about $ 9 ...

スクリーンショット 2019-12-20 5.04.42.png

Completed code

import numpy as np
import pandas as pd
import scipy

from sklearn.linear_model import Ridge
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer

NUM_BRANDS = 2500
NAME_MIN_DF = 10
MAX_FEAT_DESCP = 10000

print("Reading in Data")
df = pd.read_csv('train.tsv', sep='\t')

print('Formatting Data')
shape = df.shape[0]
train_df = df[:shape-10000]
test_df = df[shape-10000:]

target = test_df.loc[:, 'price'].values
target = np.log1p(target)

print("Concatenate data")
df = pd.concat([train_df, test_df], 0)

nrow_train = train_df.shape[0]
y_train = np.log1p(train_df["price"])

def handle_missing_inplace(dataset):
    dataset['category_name'].fillna(value="Othe", inplace=True)
    dataset['brand_name'].fillna(value='missing', inplace=True)
    dataset['item_description'].fillna(value='None', inplace=True)

print('Handle missing')
handle_missing_inplace(df)

def to_categorical(dataset):
    dataset['category_name'] = dataset['category_name'].astype('category')
    dataset['brand_name'] = dataset['brand_name'].astype('category')
    dataset['item_condition_id'] = dataset['item_condition_id'].astype('category')

print('Convert categorical')
to_categorical(df)

print('Cut')
pop_brands = df["brand_name"].value_counts().index[:NUM_BRANDS]
df.loc[~df["brand_name"].isin(pop_brands), "brand_name"] = "missing"

print("Name Encoders")
count_name = CountVectorizer(min_df=NAME_MIN_DF)
X_name = count_name.fit_transform(df["name"])

print("Category Encoders")
count_category = CountVectorizer()
X_category = count_category.fit_transform(df["category_name"])

print("Descp encoders")
tfidf_descp = TfidfVectorizer(max_features = MAX_FEAT_DESCP,
                              ngram_range = (1,3),
                              stop_words = "english")
X_descp = tfidf_descp.fit_transform(df["item_description"])

print("Brand encoders")
label_brand = LabelBinarizer(sparse_output=True)
X_brand = label_brand.fit_transform(df["brand_name"])

print("Dummy Encoders")
X_dummies = scipy.sparse.csr_matrix(pd.get_dummies(df[[
    "item_condition_id", "shipping"]], sparse = True).values, dtype=int)

X = scipy.sparse.hstack((X_dummies,
                         X_descp,
                         X_brand,
                         X_category,
                         X_name)).tocsr()

print("Finished to create sparse merge")

X_train = X[:nrow_train]
X_test = X[nrow_train:]

model = Ridge(solver='auto', fit_intercept=True, alpha=3)

print("Fitting Rige")
model.fit(X_train, y_train)

print("Predicting price Ridge")
preds1 = model.predict(X_test)

def rmsle(Y, Y_pred):
    assert Y.shape == Y_pred.shape
    return np.sqrt(np.mean(np.square(Y_pred - Y )))

print("Ridge RMSL error on dev set:", rmsle(target, preds1))

def rmsle_lgb(labels, preds):
    return 'rmsle', rmsle(preds, labels), False

train_X, valid_X, train_y, valid_y = train_test_split(X_train, y_train, test_size=0.3, random_state=42)

lgbm_params = {'n_estimators': 1000, 'learning_rate': 0.4, 'max_depth': 15,
               'num_leaves': 40, 'subsample': 0.9, 'colsample_bytree': 0.8,
               'min_child_samples': 50, 'n_jobs': 4}

model = LGBMRegressor(**lgbm_params)
print('Fitting LGBM')
model.fit(train_X, train_y,
          eval_set=[(valid_X, valid_y)],
          eval_metric=rmsle_lgb,
          early_stopping_rounds=100,
          verbose=True)

print("Predict price LGBM")
preds2 = model.predict(X_test)

print("LGBM RMSL error on dev set:", rmsle(target, preds2))

preds = (preds1 + preds2) / 2

print("Ridge + LGBM RMSL error on dev set:", rmsle(target, preds))

test_df["price1"] = np.expm1(preds1)
test_df['price2']=np.exp(preds2)
test_df['price']= np.expm1(preds)
test_df['real_price'] = np.expm1(target)

Summary

As a result of estimating the fair price, we got a better score than expected. I think that the accuracy would have improved a little more if the min_df value, n-gram range setting, etc. were changed in the preprocessing part, and if the text was not just applied to tfidf but made more detailed corrections. .. Also, the values of products are different for each person, so you can only predict to some extent. If you find it helpful, please give me a good button!

[PYTHON] Reasonable price estimation of Mercari by machine learning