[PYTHON] Kaggle Summary: Outbrain # 2

2nd place solution I will introduce the approach of the runner-up. 2nd place solution | team brain-afk Feature generation, multiple models, stacking and FFM customization.

Most important features

There are some places where the understanding of the English translation is doubtful, so if you are interested, please check the original text. There are some questions in the discussion part at the bottom.

Model used in Layer 1

Except for LibFFM and FTRL, it may be similar to a normal classification competition. I think it is quite rare to have Liblinear. Is Keras just a multi-layer perceptron? No particular mention is made.

Alexey customized FFM. As a result, the speed is increased and the memory consumption is low. The code for k will be published to github immediately.

CV&Meta Modeling The word "belnd" and "stacking" mentioned here are the story of ensemble learning.

We used 6M rows as a self-validation set and sampled according to the structure of the test set. (2 future days, 50% common days / 50% future days rows). In addition, a subset of about 14M was used for training for new ideas faster (?).

Before Alexey joined the team, he did 20 layer1 models in a 6M set, with a 0.003 point improvement on the public board. Then, I learned the data of 6M set with the blend model of XGBoost & Keras. (=> no common days / future days separation (what do you mean ???))

In the last week of the deadline, one model improved a lot and the stacking result improved. In the prediction of Layer1, the generalized time was used as the feature quantity for Layer2. This improved the score of 0.00020. Alexey joined and finished the final submission by merging and blending the stack data he had.

Final solution We submitted the bagging results of Alexey's meta stack, XGBoost, and Keras by geometric mean. The weight was intuitively determined with reference to the LB score.

Best single model Alexey's custom sh FFM implementation gave the best accuracy. 0.70017 in public leader boarc.

rcarson features

[Extract lean in 30 mins with small memory](https://www.kaggle.com/jiweiliu/outbrain-click-prediction/extract-leak-in-30-mins-with-small-memory/] published by rcarson Introducing the feature quantity generation of code). (Click here for gist)

There are two data to use, page_views.csv and promoted_content.csv. To summarize, page_views shows the id of the web page that the user has visited, and promoted_content shows the details of each ad id.

for c,row in enumerate(csv.DictReader(open('../input/promoted_content.csv'))):
    if row['document_id'] != '':
        leak[row['document_id']] = 1 

If there is a document id included in promoted_content, flag leak.

filename = '../input/page_views.csv'
filename = '../input/page_views_sample.csv' # comment this out locally
for c,row in enumerate(csv.DictReader(open(filename))):
    if count>limit:
	    break
    if c%1000000 == 0:
        print (c,count)
    if row['document_id'] not in leak:
	    continue
    if leak[row['document_id']]==1:
	    leak[row['document_id']] = set()
    lu = len(leak[row['document_id']])
    leak[row['document_id']].add(row['uuid'])
    if lu!=len(leak[row['document_id']]):
	    count+=1

After that, when a page containing promoted_content comes in page_views, its user id (uuid) is added.

for i in leak:
    if leak[i]!=1:
	    tmp = list(leak[i])
	    fo.write('%s,%s\n'%(i,' '.join(tmp)))
	    del tmp

Finally, the document information with one or more uuids included in leak is written to a file.

Looking at the exported file, it looks like this.

Screen Shot 2017-03-10 at 16.20.29.png

The link file of document_id and uuid is completed. In other words, it is a file that describes information about which user visited on a unique Web page.

About libffm

Collaborative filtering, an advanced version of FM. It has been used frequently since the middle of last year in competitions dealing with that type of data.

As of March 9, 2017, as a result of trying to install libffm on linux and mac, some bugs occur. The problem is that the latest sdk does not have nanosocket, and import ffm cannot be done if installed with an earlier version. I've been on the agenda several times on turi-code issues and stackoverflow, but it doesn't seem to be resolved. Therefore, the libffm sample program frequently used by the competition participants is not used here. It's a shame because it's a library that has produced very good results this time.

Once the problem is resolved, I'll cover the details in another article.

FTRL-Proximal SRK has released python code using FTRL. Click here for gist.

FTRL is the algorithm used by Google for CTR prediction this is the original paper.

It is said that it uses a program that uses vowpal wabbit and this code. I haven't actually used vowpal wabbit, but Kaggle's superiors are using it quite a bit, so I'm thinking of implementing it together with FTRL and uploading it as another article. It seems that the correlation with the result predicted by a classifier such as Tensorflow is small, and it can be used when combining in ensemble learning.

liblinear A tool for linear prediction for large datasets. It's rare to see it used in competitions, but it seems to lead to unexpectedly higher scores. Reference site (English)

Recommended Posts

Kaggle Summary: Outbrain # 2
Kaggle Summary: Outbrain # 1
Kaggle related summary
Kaggle Summary: Redhat (Part 1)
Kaggle Summary: BOSCH (kernels)
Kaggle Summary: BOSCH (winner)
Kaggle Summary: Redhat (Part 2)
Kaggle Kernel Method Summary [Image]
Kaggle Summary: Instacart Market Basket Analysis
Kaggle Summary: BOSCH (intro + forum discussion)
[Survey] Kaggle --Quora 3rd place solution summary
Python Summary
samba summary
[Survey] Kaggle --Quora 5th place solution summary
Django Summary
python-pptx summary
[Survey] Kaggle --Quora 4th place solution summary
Linux Summary
Python summary
Django Summary
pyenv summary
String summary 1
pytest summary
matplotlib summary
[Survey] Kaggle --Quora 2nd place solution summary
Kaggle Summary: Planet, Understanding the Amazon from Space
Kaggle Kernel Method Summary [Table Time Series Data]