[PYTHON] [Survey] Kaggle --Quora 5th place solution summary

Kaggle --Quora Question Pairs [^ 1] 5th place solution [^ 2] research article.

Title: [5th] 5th Place Solution Summary Author: Faron, KazAnova Discussion URL: https://www.kaggle.com/c/quora-question-pairs/discussion/34349 Code (attached to forum): https://kaggle2.blob.core.windows.net/forum-message-attachments/190488/6625/mark_dodgie_qs_in_test.py

Summary

--Learning more than 600 features with XGBoost --The feature quantity of 25 or more is the predicted value of the model on the right (LightGBM, NN, LSTM, SGD). --Over sampling of positive examples to about 0.13.

NLP

Extract features by various methods

--Extracted from preprocessed text --Original --Question --Stemming process --Text cleaning --Stop word only --Stop word removal --Extracted by aggregating tokens --Common / non-common tokens --Number of tokens --The longest substring common to both questions --Mistaken grammar and punctuation

--Learned GloVe

Graph structure

The feature of the graph with each question as a node and questions 1 and 2 as edges was valuable.

--Number of questions common to both questions --Number of unique questions --Number of paths of length n between questions 1 and 2 --Maximum number of creeks --Number of ingredients --If y (q1, q3) = y (q2, q3) = a, then y (q1, q2) = a

Other features

Since the method of making negative examples was artificial, the following features also led to improvement.

References

Recommended Posts

[Survey] Kaggle --Quora 5th place solution summary
[Survey] Kaggle --Quora 4th place solution summary
[Survey] Kaggle --Quora 3rd place solution summary
[Survey] Kaggle --Quora 2nd place solution summary
[Survey] Kaggle --Data Science Bowl 2017, 2nd place solution
Kaggle Summary: Outbrain # 2
Kaggle Summary: Outbrain # 1
Kaggle related summary
Kaggle Summary: BOSCH (kernels)
Kaggle Summary: BOSCH (winner)
Kaggle Summary: Redhat (Part 2)
Win with Kaggle by practicing "Kaggle Wins Data Analysis Technology" --Kaggle M5 Forecasting Accuracy 59th (of 5558) Solution Summary