This article is the 19th day article of Kaggle Advent Calendar 2019.
Hi, my name is kiccho1101! Write an article for the first time in my life! Thank you!
This time, I'd like to introduce Kaggle's feature management because it's better than I expected when I tried it with PostgreSQL.
Click here for the created directory: https://github.com/kiccho1101/datascience-template ↑ The README contains usage examples using data from the Titanic Competition.
In the Kaggle competition, if you write the code without thinking like I (formerly), the following problems will occur.
――I don't know what the features represent --notebook is in chaotic state (exp1.ipynb, exp1_tmp.ipynb, exp1_tmp_tmp.ipynb, etc. are generated in large numbers) ――When I looked back half a year later, did I write it myself? ?? ?? Code is generated
In order to solve these problems, it is necessary to manage the features in some way.
For feature management, [Takanobu Nozawa's slide](https://speakerdeck.com/takapy/detafen-xi-konpenioite-te-zheng-liang-guan-li-nipi-bi-siteiruquan-ren-lei- nichuan-etaixiang-i) is very easy to understand, so I would appreciate it if you could refer to it.
The features of this directory are summarized below.
--Manage data with PostgreSQL on Docker container --Data can be manipulated in both SQL and Python --Makefile is used to turn a series of flows (feature generation, cross-validation, prediction) into a command line tool. --Manage experiment contents with config file
By using a database
You can see the data in the database viewer such as. This is really good. EDA is much easier than doing it with pandas.
Like this, I made it possible to execute frequently performed operations with the make command. It's just less code to write, but it's pretty comfortable to code.
This time, the code will be the main one, so I've only briefly explained it in the article. If you find it "interesting", please clone it and use it! !!
Finally Pull request & fork are welcome! !! !! Please feel free to contact us! !! !!
Recommended Posts