[PYTHON] [Survey] Kaggle --Quora 2nd place solution summary

Kaggle --Quora Question Pairs [^ 1] 2nd place solution [^ 2] research article.

Author: Silogram Title: Overview of 2nd-Place Solution Discussion URL: https://www.kaggle.com/c/quora-question-pairs/discussion/34310

Summary

--Ensemble of 6 LightGBM [^ 3] and 1 neural network --Calibration using graph structural properties (similar to Jared (3rd place) method [^ 4]) --Thousands of feature dimensions (including sparse N-gram vectors) --Score on LB in a single model is 0.116 to 0.117 --What was useful in NLP processing is text processing in many different ways (e.g. change to lowercase and no change, differently converted punctuation, stopword removal and non-removal, with and without stemming, etc.)

Problems in the contest

--Since there was a problem with the questioning method, the problem with the graph structure that can be created by the question pair was important. --There were many questions related to India, which affected TFIDF and TF (Isn't it better if there is no regional influence?) --Insufficient label was noticeable

About sparse N-gram

--Use binary tf. Remove the top 2000 1-grams and 2-grams --The vectors of questions 1 and 2 are added up and converted into 3 labels for each N-gram (0: none. 1: only one exists, 2: exists in both)

References