[PYTHON] [Survey] Kaggle --Quora 4th place solution summary

Kaggle --Quora Question Pairs [^ 1] 4th place solution [^ 2] research article.

Title: [4th] Overview of 4th-Place Solution Author: HouJP Discussion URL: https://www.kaggle.com/c/quora-question-pairs/discussion/34349 Code: https://github.com/HouJP/kaggle-quora-question-pairs

flow

  1. Preprocessing
  2. Feature extraction
  3. Model building
  4. Post-processing

image.png Quoted from HouJP / kaggle-quora-question-pairs [^ 4]

Preprocessing

--Text-cleaning: Correcting typographical errors, handling symbols, restoring acronyms, etc. --Word-stemming: Snowball Stemmer [^ 3], etc. --Shared-word-removing: Removal of words that appear in both

Feature extraction

--More than 1400 features --Statistics: ratio of common words, sentence length, number of words, etc. --Natural language processing: parsing syntax trees, number of negative words, etc. --Graph structure: PageRank, hits, shortest path, creek size, etc.

Model building

--Neural network, XGBoost, LightGBM, Logistic Regression (LB = 0.122 to 0.124 is the best for a single model) --140 model Model Stacking (0.007 improvement on LB)

Post-processing

--Since the tendency of the data was different between the training data and the test data, it was necessary to adjust the weights. --Dividing the data according to the size of the creek and adjusting the weight (this operation improves 0.001 on LB)

References

Recommended Posts

[Survey] Kaggle --Quora 5th place solution summary
[Survey] Kaggle --Quora 4th place solution summary
[Survey] Kaggle --Quora 3rd place solution summary
[Survey] Kaggle --Quora 2nd place solution summary
[Survey] Kaggle --Data Science Bowl 2017, 2nd place solution
Kaggle Summary: Outbrain # 1
Kaggle related summary
Kaggle Summary: Redhat (Part 1)
Kaggle Summary: BOSCH (kernels)
Kaggle Summary: BOSCH (winner)
Kaggle Summary: Redhat (Part 2)
Win with Kaggle by practicing "Kaggle Wins Data Analysis Technology" --Kaggle M5 Forecasting Accuracy 59th (of 5558) Solution Summary