[PYTHON] About time series data and overfitting

How to prevent overfitting of LGBM these days is a hot topic in me.

I noticed how to separate train data and valid data from time series data.

Until now, I thought that random split would be better even for time series data. To put it simply, if a certain date and time is set as a threshold value, the train data for spring, summer, and autumn will be learned without having winter information, so it may be an incomplete model.

However, it turned out that there was a problem with random split. It depends on the particle size of datetime, but for example, the train data contains the data of the minute immediately before the valid data, so it is extremely easy to overfit.

My current best practice is to divide the year into four parts, spring, summer, autumn, and winter, and create a four-pattern model depending on which valid is used. Take the average of the predicted values from the four models.

====

I wrote a memo about two weeks ago, The following article has exactly the same idea as I thought, so share it! !!

http://tmitani-tky.hatenablog.com/entry/2018/12/19/001304

It seems that scikit-learn also has something to validate as I hope.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

Recommended Posts

About time series data and overfitting
Reading OpenFOAM time series data and sets data
[Python] Plot time series data
About installing Pwntools and Python2 series
Python: Time Series Analysis: Preprocessing Time Series Data
Graph time series data in Python using pandas and matplotlib
A story about clustering time series data of foreign exchange
Differentiation of time series data (discrete)
Time series analysis 3 Preprocessing of time series data
About _ and __
Comparison of time series data predictions between SARIMA and Prophet models
When plotting time series data and getting a matplotlib Overflow Error
Forecasting time series data with Simplex Projection
Predict time series data with neural network
Time series data anomaly detection for beginners
How to handle time series data (implementation)
Time series analysis # 6 Spurious regression and cointegration
Time Series Decomposition
Underfitting and overfitting
Kaggle Kernel Method Summary [Table Time Series Data]
Acquisition of time series data (daily) of stock prices
View details of time series data with Remotte
How to read time series data in PyTorch
Format and display time series data with different scales and units with Python or Matplotlib
Python: Time Series Analysis
About machine learning overfitting
Features that can be extracted from time series data
Reading, summarizing, visualizing, and exporting time series data to an Excel file with Python
Visualize data and understand correlation at the same time
[Latest method] Visualization of time series data and extraction of frequent patterns using Pan-Matrix Profile
Python time series question
RNN_LSTM1 Time series analysis
Time series analysis 1 Basics
Time series data prediction by AutoML (automatic machine learning)
About cross-validation and F-number
Display TOPIX time series
Time series plot / Matplotlib
It's time to seriously think about the definition and skill set of data scientists
"Measurement Time Series Analysis of Economic and Finance Data" Solving Chapter End Problems with Python
How to generate exponential pulse time series data in python
I wanted to worry about execution time and memory usage
This and that about pd.DataFrame
Linux (about files and directories)
Python 2 series and 3 series (Anaconda edition)
Time series analysis related memo
About python objects and classes
Data handling 3 (development) About data format
About Python variables and objects
About LINUX files and processes
About Raid group and LUN
About fork () function and execve () function
About Django's deconstruct and deconstructible
Date and time ⇔ character string
About Python, len () and randint ()
About Python datetime and timezone
About Sharpe Ratio and Sortino Ratio
Time series analysis part 4 VAR
Time series analysis Part 3 Forecast
Point and Figure Data Modeling
About Python and regular expressions
Time series analysis Part 1 Autocorrelation