It can be difficult to know what is really worth it. Less detail can make a big difference in price. For example, one of these sweaters is $ 335 and the other is $ 9.99. Can you guess which is which?
Given the number of products sold online, pricing products is even more difficult. The price of clothing has strong seasonal price trends and is strongly influenced by the brand name, but the price of electronic devices fluctuates based on the product specifications.
Japan's largest community-driven shopping app is deeply aware of this issue. It is difficult to provide a good price offer to the seller, as the seller can put anything or anything in the Mercari marketplace.
The Mercari Price Suggestion Challenge is a competition that estimates the "reasonable price" of a product from the actual product data that was put up for sale. Product data includes product name, product description, product status, brand name, category name, etc., and based on these, machine learning is used to predict the selling price.
The product dataset is published by the North American version of Mercari, so anyone can get it. https://www.kaggle.com/c/mercari-price-suggestion-challenge/data
This time, I would like to use this data to estimate the appropriate price.
train.tsv has data of 1.5 million items actually listed. All notations are in English because of the data of the North American version of Mercari. The product is described from 8 columns.
column | Description |
---|---|
train_id | User Post ID |
name | Product name |
item_condition_id | Product condition |
category_name | Product category |
brand_name | brand name |
price | Selling price (dollars) |
shipping | Shipping cost (exhibitor or purchaser) |
item_description | Product description |
These data are divided into train and test, and the selling price is predicted by machine learning.
The execution environment is on Google Colaboratory. Since the number of data is extremely large, it will take time unless it is in a GPU environment.
Please refer to here for Goggle Colboratory. Google Colaboratory overview and usage procedure (TensorFlow and GPU can be used)
RSMLE is used when you want to express a distribution close to ** lognormal distribution **, and the ** error between the measured value and the predicted value as a ratio or ratio ** instead of a width.
Looking at the figure above, the histogram of commodity prices looks like a lognormal distribution. Also, for example The error widths of (1000, 5000) and (100000, 104000) are 4000 each other, but the error ratios are different and this difference is large.
From that point, the estimated price seems to be suitable for the evaluation method by RMSLE.
Not only train.tsv but also test.tsv is published, but since it does not have a correct answer label, the data obtained by removing about 10,000 items from train.tsv is used as test data.
Overall data (* 1482535, 8 )-> (train_df ( 1472535, 8 ), test_df ( 10000, 7 *))
There are many blanks in categories, brands, and product descriptions. In machine learning, it is normal to process defects, so fill in the blanks with the following function. As a result of missing, brand "missing" accounted for 42% of the total.
def handle_missing_inplace(dataset):
dataset['category_name'].fillna(value="Other", inplace=True)
dataset['brand_name'].fillna(value='missing', inplace=True)
dataset['item_description'].fillna(value='None', inplace=True)
The brand is cut before the type conversion. Since there are about 5000 types of brands, brand names that appear extremely few times are not very useful for learning, so enter the same "missing" as the blank.
pop_brands = df["brand_name"].value_counts().index[:NUM_BRANDS]
df.loc[~df["brand_name"].isin(pop_brands), "brand_name"] = "missing"
After cutting about half, the minimum number of appearances was 4 times.
Converts text data to categorical type. This is because dummy variables are created in later processing.
def to_categorical(dataset):
dataset['category_name'] = dataset['category_name'].astype('category')
dataset['brand_name'] = dataset['brand_name'].astype('category')
dataset['item_condition_id'] = dataset['item_condition_id'].astype('category')
Applies CountVectorizer to product names and category names. To put it simply, CountVectorizer is vectorized according to the number of occurrences. For example, if you perform Count Vectorizer on three product names,'MLB Cincinnati Reds T Shirt Size XL',' AVA-VIV Blouse', and'Leather Horse Statues', they will be vectorized as follows.
In addition, because the product name is entered by the seller, there may be typographical errors in the word or fixed words or numbers that appear only in specific sentences. With these in mind, add the option min_df to CountVectorizer. min_df means to exclude words that appear less than min_df%.
count_name = CountVectorizer(min_df=NAME_MIN_DF)
X_name = count_name.fit_transform(df["name"])
count_category = CountVectorizer()
X_category = count_category.fit_transform(df["category_name"])
Unlike CountVectorizer, TfidfVectorizer considers not only the number of occurrences of a word but also the rarity of the word. For example, words that exist in every sentence such as "desu" and "masu", and articles such as "a" and "the" appear frequently in English, and are greatly dragged by such words in Count Vectorizer. Instead, it is used when you want to vectorize by focusing on the importance of words.
In other words, TfidfVecotrizer means "weighting words that appear frequently in one document and infrequently in another document to give them high importance."
From the above points, the product description will be vectorized by TfidfVectorizer.
Then, the table is as shown above, and the articles and conjunctions have strong tfidf values. Specify stop_word ='english'
because such words still have no meaning in learning.
Next, the figure on the left shows the bottom 10 tfidf values. If the tfidf value is extremely small, it doesn't make much sense, so delete it. Also, instead of taking tfidf for one word, take tfidf for consecutive words. For example, let's set n-gram with the saying "an apple a day keeps the doctor away" (one apple a day keeps the doctor away).
n-gram(1, 2)
{'an': 0, 'apple': 2, 'day': 5, 'keeps': 9, 'the': 11,'doctor':7,'away': 4,
'an apple': 1, 'apple day': 3, 'day keeps': 6, 'keeps the': 10,
'the doctor': 12, 'doctor away': 8}
n-gram(1, 3)
{'an': 0, 'apple': 3, 'day': 7, 'keeps': 12, 'the': 15, 'doctor': 10,'away': 6,
'an apple': 1, 'apple day': 4, 'day keeps': 8, 'keeps the': 13,'the doctor': 16,
'doctor away': 11, 'an apple day': 2, 'apple day keeps': 5, 'day keeps the': 9,
'keeps the doctor': 14, 'the doctor away': 17}
In this way, as the n-gram range increases, the characteristics of the text are captured in more detail, and useful data is acquired. With the addition of options, it looks like the figure on the right.
The final tfidf will look like the figure below. The one with the highest tfidf value is "description", and you can see that this is influenced by "Not description yet". You can see that new and used items such as "new" and "used" also affect the price.
tfidf_descp = TfidfVectorizer(max_features = MAX_FEAT_DESCP,
ngram_range = (1,3),
stop_words = "english")
X_descp = tfidf_descp.fit_transform(df["item_description"])
As I mentioned earlier, there are about 5,000 types of brands, and as a result of the cutting process, there are about 2,500 types of brands.
Label these with 0 or 1. Since there is a lot of data, set sparse_output = True
and execute.
label_brand = LabelBinarizer(sparse_output=True)
X_brand = label_brand.fit_transform(df["brand_name"])
Dummy variables are a technique for converting non-numeric data into numbers. Specifically, it converts non-numeric data into a sequence of only "0" and "1". Here, dummy variables are created for the product status and shipping costs.
X_dummies = scipy.sparse.csr_matrix(pd.get_dummies(df[[
"item_condition_id", "shipping"]], sparse = True).values, dtype=int)
Now that we have processed all the columns, we will combine all the arrays and put them in the model.
X = scipy.sparse.hstack((X_dummies,
X_descp,
X_brand,
X_category,
X_name)).tocsr()
Since all parameters cannot be explained, some parameters are briefly summarized.
option | desc |
---|---|
alpha | Degree of normalization to prevent overfitting |
max_iter | Maximum number of learning iterations |
tol | Subject to a score increase of tol or higher |
alpha
It is possible to build a model that is overfitted to the given data and has a small error for the given training data, but it is called "** overfitting " that proper prediction for unknown data cannot be made well. say. Therefore, overfitting can be prevented by setting restrictions on parameter learning. Such a limitation is called " normalization **".
option | description |
---|---|
n_esimators | Number of decision trees |
learning_rate | Weight of each tree |
max_depth | Maximum depth of each tree |
num_leaves | Number of leaves |
min_child_samples | Minimum number of data contained in end node |
n_jobs | Number of parallel processes |
learning_rate
--Generally speaking, accuracy increases, but overfitting becomes easier. --If it is too small, the calculation load will be large and the processing will take time.
n_estimatiors
--The most important parameters in Random Forest
Ridge
First, search for the optimum value of alpha. Move alpha in the range of 0.05 to 75 to visualize the effect on accuracy
From the figure, the minimum value * RMSLE 0.4745938085035464 * was obtained when alpha = 3.0.
Next, as a result of verifying the maximum number of searches max_iter in all ranges, no improvement in accuracy was obtained. Also, the higher the tol value, the less accurate it was.
From the above, the Ridge parameter is modeled with alpha = 3.
LGBM For the parameter adjustment of LGBM, I referred to the documents. https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html
It seems to be a standard practice to start by setting learning_rate and n_estimatiors as the first step in adjusting the parameters of LGBM. To improve accuracy, learning_rate should be small and n_estimatiors should be large. Move learning_rate in the range of 0.05 to 0.7 to adjust n_estimatiors.
Next, after setting learning_rate and n_estimatiors, move num_leaves.
(num_leaves = 20) RMSLE 0.4620242411418184 ↓ (num = 31) RMSLE 0.4569169142862856 ↓ (num = 40) RMSLE 0.45587232757584967
It turns out that increasing num_leaves overall improves accuracy. Overall here means even if other parameters are adjusted.
However, while adjusting each parameter, if num_leaves was raised too much, ** overfitting ** would occur, and in some cases a good score could not be obtained. I had to adjust it well with other parameters.
When learning_rate = 0.7 max_depth = 15, num_leaves = 30 RMSLE 44.650714399639845
The final LGBM model looks like this:
lgbm_params = {'n_estimators': 1000, 'learning_rate': 0.4, 'max_depth': 15,
'num_leaves': 40, 'subsample': 0.9, 'colsample_bytree': 0.8,
'min_child_samples': 50, 'n_jobs': 4}
Rideg + LGBM is used to calculate the predicted value. LGBM has a better score than Ridge, but by combining the two models, you can improve the accuracy. Ridge RMSL error on dev set: 0.47459370995217937
LGBM RMSL error on dev set: 0.45317097672035855
Ridge + LGBM RMSL error on dev set: 0.4433081424824549
This accuracy is an estimated error range of 18.89 to 47.29 for a $ 30 item.
price is the predicted value by Ridge + LGBM, and real_price is the measured value. About 7553 out of 10,000 test data had an error of less than $ 10.
Residual plot with log
Distribution of actual and predicted prices
I simply took the difference, but there are about 90 products that have a difference of 100 dollars or more between the predicted value and the measured value. Since this dataset is data from two years ago, Apple Watch etc. are relatively new products, so you can see that the number of data is small and it can not be predicted well. Also, it ’s Mercari. It's good, but not everything can be predicted well because of the pricing based on personal values. The coach bag was actually sold for about $ 9 ...
import numpy as np
import pandas as pd
import scipy
from sklearn.linear_model import Ridge
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
NUM_BRANDS = 2500
NAME_MIN_DF = 10
MAX_FEAT_DESCP = 10000
print("Reading in Data")
df = pd.read_csv('train.tsv', sep='\t')
print('Formatting Data')
shape = df.shape[0]
train_df = df[:shape-10000]
test_df = df[shape-10000:]
target = test_df.loc[:, 'price'].values
target = np.log1p(target)
print("Concatenate data")
df = pd.concat([train_df, test_df], 0)
nrow_train = train_df.shape[0]
y_train = np.log1p(train_df["price"])
def handle_missing_inplace(dataset):
dataset['category_name'].fillna(value="Othe", inplace=True)
dataset['brand_name'].fillna(value='missing', inplace=True)
dataset['item_description'].fillna(value='None', inplace=True)
print('Handle missing')
handle_missing_inplace(df)
def to_categorical(dataset):
dataset['category_name'] = dataset['category_name'].astype('category')
dataset['brand_name'] = dataset['brand_name'].astype('category')
dataset['item_condition_id'] = dataset['item_condition_id'].astype('category')
print('Convert categorical')
to_categorical(df)
print('Cut')
pop_brands = df["brand_name"].value_counts().index[:NUM_BRANDS]
df.loc[~df["brand_name"].isin(pop_brands), "brand_name"] = "missing"
print("Name Encoders")
count_name = CountVectorizer(min_df=NAME_MIN_DF)
X_name = count_name.fit_transform(df["name"])
print("Category Encoders")
count_category = CountVectorizer()
X_category = count_category.fit_transform(df["category_name"])
print("Descp encoders")
tfidf_descp = TfidfVectorizer(max_features = MAX_FEAT_DESCP,
ngram_range = (1,3),
stop_words = "english")
X_descp = tfidf_descp.fit_transform(df["item_description"])
print("Brand encoders")
label_brand = LabelBinarizer(sparse_output=True)
X_brand = label_brand.fit_transform(df["brand_name"])
print("Dummy Encoders")
X_dummies = scipy.sparse.csr_matrix(pd.get_dummies(df[[
"item_condition_id", "shipping"]], sparse = True).values, dtype=int)
X = scipy.sparse.hstack((X_dummies,
X_descp,
X_brand,
X_category,
X_name)).tocsr()
print("Finished to create sparse merge")
X_train = X[:nrow_train]
X_test = X[nrow_train:]
model = Ridge(solver='auto', fit_intercept=True, alpha=3)
print("Fitting Rige")
model.fit(X_train, y_train)
print("Predicting price Ridge")
preds1 = model.predict(X_test)
def rmsle(Y, Y_pred):
assert Y.shape == Y_pred.shape
return np.sqrt(np.mean(np.square(Y_pred - Y )))
print("Ridge RMSL error on dev set:", rmsle(target, preds1))
def rmsle_lgb(labels, preds):
return 'rmsle', rmsle(preds, labels), False
train_X, valid_X, train_y, valid_y = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
lgbm_params = {'n_estimators': 1000, 'learning_rate': 0.4, 'max_depth': 15,
'num_leaves': 40, 'subsample': 0.9, 'colsample_bytree': 0.8,
'min_child_samples': 50, 'n_jobs': 4}
model = LGBMRegressor(**lgbm_params)
print('Fitting LGBM')
model.fit(train_X, train_y,
eval_set=[(valid_X, valid_y)],
eval_metric=rmsle_lgb,
early_stopping_rounds=100,
verbose=True)
print("Predict price LGBM")
preds2 = model.predict(X_test)
print("LGBM RMSL error on dev set:", rmsle(target, preds2))
preds = (preds1 + preds2) / 2
print("Ridge + LGBM RMSL error on dev set:", rmsle(target, preds))
test_df["price1"] = np.expm1(preds1)
test_df['price2']=np.exp(preds2)
test_df['price']= np.expm1(preds)
test_df['real_price'] = np.expm1(target)
As a result of estimating the fair price, we got a better score than expected. I think that the accuracy would have improved a little more if the min_df value, n-gram range setting, etc. were changed in the preprocessing part, and if the text was not just applied to tfidf but made more detailed corrections. .. Also, the values of products are different for each person, so you can only predict to some extent. If you find it helpful, please give me a good button!
Recommended Posts