[PYTHON] [Survey] Kaggle --Quora 4th place solution summary

Kaggle --Quora Question Pairs [^ 1] 4th place solution [^ 2] research article.

Title: [4th] Overview of 4th-Place Solution Author: HouJP Discussion URL: https://www.kaggle.com/c/quora-question-pairs/discussion/34349 Code: https://github.com/HouJP/kaggle-quora-question-pairs

flow

Preprocessing
Feature extraction
Model building
Post-processing

Quoted from HouJP / kaggle-quora-question-pairs [^ 4]

Preprocessing

--Text-cleaning: Correcting typographical errors, handling symbols, restoring acronyms, etc. --Word-stemming: Snowball Stemmer [^ 3], etc. --Shared-word-removing: Removal of words that appear in both

Feature extraction

--More than 1400 features --Statistics: ratio of common words, sentence length, number of words, etc. --Natural language processing: parsing syntax trees, number of negative words, etc. --Graph structure: PageRank, hits, shortest path, creek size, etc.

Model building

--Neural network, XGBoost, LightGBM, Logistic Regression (LB = 0.122 to 0.124 is the best for a single model) --140 model Model Stacking (0.007 improvement on LB)

Post-processing

--Since the tendency of the data was different between the training data and the test data, it was necessary to adjust the weights. --Dividing the data according to the size of the creek and adjusting the weight (this operation improves 0.001 on LB)

References