[PYTHON] Kaggle's feature management was better than I expected with PostgreSQL, so I created a directory that anyone can use with Docker.

This article is the 19th day article of Kaggle Advent Calendar 2019.

Hi, my name is kiccho1101! Write an article for the first time in my life! Thank you!

Introduction

This time, I'd like to introduce Kaggle's feature management because it's better than I expected when I tried it with PostgreSQL.

Click here for the created directory: https://github.com/kiccho1101/datascience-template ↑ The README contains usage examples using data from the Titanic Competition.

What is feature management?

In the Kaggle competition, if you write the code without thinking like I (formerly), the following problems will occur.

――I don't know what the features represent --notebook is in chaotic state (exp1.ipynb, exp1_tmp.ipynb, exp1_tmp_tmp.ipynb, etc. are generated in large numbers) ――When I looked back half a year later, did I write it myself? ?? ?? Code is generated

In order to solve these problems, it is necessary to manage the features in some way.

For feature management, [Takanobu Nozawa's slide](https://speakerdeck.com/takapy/detafen-xi-konpenioite-te-zheng-liang-guan-li-nipi-bi-siteiruquan-ren-lei- nichuan-etaixiang-i) is very easy to understand, so I would appreciate it if you could refer to it.

Overview

The features of this directory are summarized below.

--Manage data with PostgreSQL on Docker container --Data can be manipulated in both SQL and Python --Makefile is used to turn a series of flows (feature generation, cross-validation, prediction) into a command line tool. --Manage experiment contents with config file

Manage data with PostgreSQL on Docker container

By using a database

You can see the data in the database viewer such as. This is really good. EDA is much easier than doing it with pandas.

Command line tool with Makefile

Feature generation

Cross Validation

Forecast

Like this, I made it possible to execute frequently performed operations with the make command. It's just less code to write, but it's pretty comfortable to code.

Summary

This time, the code will be the main one, so I've only briefly explained it in the article. If you find it "interesting", please clone it and use it! !!

Finally Pull request & fork are welcome! !! !! Please feel free to contact us! !! !!