[PYTHON] Learn from the winning code-Mercari Competition ①-

Introduction

When studying Kaggle, I decided to study from the code of the person who won the first place in the past competition, so this time Mercari Competition -challenge "Qiita") 1st place code Was the subject of study.

What i learned

・ Time measurement using context manager ・ Pipeline and Function Transformer ・ TF-IDF, itemgetter, TfidfVectorizer ・ Accuracy can be obtained even with 4-layer MLP (Multilayer perceptron). -Use partial to fix y_train and change only x_train

Mercari Competition Overview

Contents

Creating a model that predicts a reasonable price at the time of listing

significance

By automatically presenting an appropriate price from the product information at the time of listing, the time and effort at the time of listing is reduced. Listing is easy.

background

If you sell at a high price outside the market price of Mercari, it will not sell On the contrary, if you sell at a price lower than the market price of Mercari, the customer will lose.

Competition constraints

Kernel competition: Submit the source code itself to Kaggle. Once submitted, it will be run on Kaggle to calculate your score. There are restrictions on computer resources and calculation time

CPU: 4 cores Memory: 16GB Disk: 1GB Time limit: 1 hour GPU: None

Evaluation

RMLSE:Root Mean Squared Logarithmic Error The lower the score, the smaller the error in estimating the price. image.png

The first model is RMLSE: 0.3875 image.png

Usage data

image.png

Column name Description
name Product name
item_condition_id The condition of the product, such as used or new.(1~5)The larger one is in better condition.
category_name Rough category/Detailed category/よりDetailed category
brand_name brand name. Example: Nike, Apple
price Past selling price(USD)
shipping Whether the seller or the buyer pays the shipping cost. 1->Seller pays, 0 ->The purchaser pays.
item_description Product details

Output format

Test_id and price image.png

The main point of the 1st place code

・ As short as 100 lines. simple. ・ 4-layer MLP. It is accurate. Wasn't neural networks used yet in this era? ・ TF-IDF. df ['name']. Fillna ('') +'''+ df ['brand_name']. Fillna ('') is used to combine strings to improve accuracy? ・ Standardization of y_train ・ Learn 4 models with 4 cores-> Ensemble

Preparation of teacher data

Measuring the time required for each process

Since there is a limit of one hour, some measures have been taken to measure how much time is spent in which process. With timer is put in the place of each process. Description of timer. image.png

Creation of teacher data

qiita.rb


 with timer('process train'):
#Road
        train = pd.read_table('../input/train.tsv')
#It's repelling because there is a $ 0 price
        train = train[train['price'] > 0].reset_index(drop=True)
#Preparing to split the data for training and validation
        cv = KFold(n_splits=20, shuffle=True, random_state=42)
#Divide the data into training and validation
#.split()The iterable object is returned. "Index for learning and index for verification can be retrieved.
#next()Get elements from within an iterator with
        train_ids, valid_ids = next(cv.split(train))
#Split for training and validation with the obtained index
        train, valid = train.iloc[train_ids], train.iloc[valid_ids]
#Price converts 1 row n columns to n rows 1 column. log(a+1)Convert with. Normalization
        y_train = y_scaler.fit_transform(np.log1p(train['price'].values.reshape(-1, 1)))
#Processed in pipeline
        X_train = vectorizer.fit_transform(preprocess(train)).astype(np.float32)
        print(f'X_train: {X_train.shape} of {X_train.dtype}')
        del train
#Preprocessing of verification data as well
  with timer('process valid'):
        X_valid = vectorizer.transform(preprocess(valid)).astype(np.float32)

Preprocessing

Since the brand name has a missing value, it is replaced with a blank. On top of that, the product name and brand name are combined. To make it easier to TF-IDF later. I am creating a new element called text. 'name','text','shipping', and'item_condition_id' will be used in the subsequent Pipeline processing.

qiita.rb


def preprocess(df: pd.DataFrame) -> pd.DataFrame:
    df['name'] = df['name'].fillna('') + ' ' + df['brand_name'].fillna('')
    df['text'] = (df['item_description'].fillna('') + ' ' + df['name'] + ' ' + df['category_name'].fillna(''))
    return df[['name', 'text', 'shipping', 'item_condition_id']]

It is Pipelined so that character extraction and TF-IDF calculation can be performed in a series of steps.

Description of PipeLine.

qiita.rb


def on_field(f: str, *vec) -> Pipeline:
    return make_pipeline(FunctionTransformer(itemgetter(f), validate=False), *vec)

def to_records(df: pd.DataFrame) -> List[Dict]:
    return df.to_dict(orient='records')

 vectorizer = make_union(
        on_field('name', Tfidf(max_features=100000, token_pattern='\w+')),
        on_field('text', Tfidf(max_features=100000, token_pattern='\w+', ngram_range=(1, 2))),
        on_field(['shipping', 'item_condition_id'],
                 FunctionTransformer(to_records, validate=False), DictVectorizer()),
        n_jobs=4)
    y_scaler = StandardScaler()

 X_train = vectorizer.fit_transform(preprocess(train)).astype(np.float32)

The output is the total score of 200002 for the score (Bag of Words) for the character type (200000) and the scores for'shipping'and'item_condition_id'. image.png

Learning

It learns with 4 cores and 4 threads, and then averages the ensemble. When learning, y_train is fixed at partial and only xs is changed.

qiita.rb


def fit_predict(xs, y_train) -> np.ndarray:
    X_train, X_test = xs
    config = tf.ConfigProto(
        intra_op_parallelism_threads=1, use_per_session_threads=1, inter_op_parallelism_threads=1)
    with tf.Session(graph=tf.Graph(), config=config) as sess, timer('fit_predict'):
        ks.backend.set_session(sess)
        model_in = ks.Input(shape=(X_train.shape[1],), dtype='float32', sparse=True)#MLP design
        out = ks.layers.Dense(192, activation='relu')(model_in)
        out = ks.layers.Dense(64, activation='relu')(out)
        out = ks.layers.Dense(64, activation='relu')(out)
        out = ks.layers.Dense(1)(out)
        model = ks.Model(model_in, out)
        model.compile(loss='mean_squared_error', optimizer=ks.optimizers.Adam(lr=3e-3))
        for i in range(3):#3 epoch
            with timer(f'epoch {i + 1}'):
                model.fit(x=X_train, y=y_train, batch_size=2**(11 + i), epochs=1, verbose=0)#Batch size increases exponentially
        return model.predict(X_test)[:, 0]#Return expectations


 with ThreadPool(processes=4) as pool: #4 threads
        Xb_train, Xb_valid = [x.astype(np.bool).astype(np.float32) for x in [X_train, X_valid]]
        xs = [[Xb_train, Xb_valid], [X_train, X_valid]] * 2
        y_pred = np.mean(pool.map(partial(fit_predict, y_train=y_train), xs), axis=0)#Average of what you learned in 4 cores
    y_pred = np.expm1(y_scaler.inverse_transform(y_pred.reshape(-1, 1))[:, 0])#Return what was converted by log to price
    print('Valid RMSLE: {:.4f}'.format(np.sqrt(mean_squared_log_error(valid['price'], y_pred))))

reference

[Reference ①](https://copypaste-ds.hatenablog.com/entry/2019/02/15/170121#1-%E3%82%B7%E3%83%B3%E3%83%97%E3% 83% AB% E3% 81% AAMLP "Qiita") Reference ② Reference ③ Reference ④ BRONZE acquirer's method Mercari HP

Recommended Posts

Learn from the winning code-Mercari Competition ①-
Learn Nim with Python (from the beginning of the year).
Kaggle competition process from the perspective of score transitions
Let's search from the procession
Remove the frame from the image
I tweeted from the terminal!
Learn best practices from cookiecutter-django
Reinforcement learning Learn from today
From a book that programmers can learn (Python): Find the mode
Learn Bayesian statistics from the basics to learn the M-H and HMC methods
Evaluate the accuracy of the learning model by cross-validation from scikit learn