[PYTHON] Predicting the future of Numazu's population transition by time-series regression analysis using Prophet

Thing you want to do

・ Predict the future population of Numazu City based on the vital statistics of the past several years with Python's time-series prediction library "Facebook Prophet" ・ Evaluate the analysis results ⇒ Tuning ⇒ Reanalyze

environment

** Windows10 64bit** ** Anaconda3** Virtual environment: python (3.7.3) Libraries: fbprophet (0.6), pandas (1.1.2), matplotlib (3.0.3)

Usage data

Numazu City "Population and Households" List

-Use the aggregated monthly population from the yearly files on the above site as the input file ** [dataset.csv] **. ・ Originally, there is a demographic report in the open data of Numazu City. Fuji no Kuni Open Data Catalog (Numazu City) These are annual statistics, but since we assumed monthly as the particle size of the data for time series analysis, This time, we will conveniently use the information in the Basic Resident Register on the city official website, which contains monthly data. -The price is that you need to format the PDF into a spreadsheet. It was troublesome. ・ It is composed of two rows of year, month and household personnel. Since this is a trial, there are no external variables. ・ The learning data is from 2013/04 to 2020/08.

dataset.csv

Year month Household personnel (total)
2013-04 205,790
2013-05 205,739
2013-06 205,572
・ ・ ・ ・ ・ ・
2020-07 193,896
2020-08 193,756

Predictive analysis

Library import

import pandas as pd
import matplotlib.pyplot as plt
from fbprophet import Prophet

Data capture

Jinkoudata = pd.read_csv("dataset.csv")
Jinkoudata.columns = ['ds','y']

-Import population data into the data frame [Jinkou data]. -Change the column name [year / month] [household personnel (total)] to [ds] [y] respectively (according to the prophet naming convention).

Model generation

model = Prophet(weekly_seasonality=False,daily_seasonality=False)

-Since the analysis is performed on a monthly basis this time, parameters that do not consider the weekly cycle and daily cycle are specified. -In addition, the following parameters can be specified arbitrarily when creating an instance. Parameter values in []. The value specified as an example is the default value.

●growth='linear' [logistic/linear] Specify the analysis model as linear or logistic.

●changepoints=None [DATE(YYYY-MM-DD)] Arbitrary designation of the change point of the trend (long-term trend fluctuation). In prophet, change points are automatically extracted and reflected in the prediction, but instead of using automatic detection of change points as needed, It is used with the intention of manually specifying the location of potential changepoints using this parameter.

** ● n_changepoints ** = 25 [arbitrary constant] Specify the number of change point candidates to be detected. The larger the value, the easier it is to detect many change points. If this is made too large, the change point may be detected excessively and the prediction accuracy may be reduced.

** ● yearly_seasonality ** ='auto' [true / false / auto / arbitrary constant] Specify the periodicity (year). The degree of consideration can be adjusted by specifying an arbitrary number (10 is the specified value if not specified).

** ● weekly_seasonality ** ='auto' [true / false / auto / arbitrary constant] Specify periodicity (week). The degree of consideration can be adjusted by specifying an arbitrary number (10 is the specified value if not specified).

** ● daily_seasonality ** ='auto' [true / false / auto / arbitrary constant] Specify the periodicity (day). The degree of consideration can be adjusted by specifying an arbitrary number (10 is the specified value if not specified).

●holidays=None [DataFrame] Specify the event date data frame. Incorporate holidays and dates with events into the analysis and reflect them in the forecast.

** ● changepoint_prior_scale ** = 0.05 [arbitrary constant] Specify the variance of the Laplace distribution. The larger this value, the easier it is to detect the point of change. However, it may make it too flexible. Same as n_changepoints.

** ● changepoint_range ** = 0.8 [arbitrary constant] Specify the percentage of data for change point detection. Since the specified value is 0.8, the change points are extracted for 80% from the beginning of the learning data. For example, if there is a big movement in the trend at the end of the data, set it to "1.0". The change point will be detected using the entire range of data including the fluctuation in the final stage.

There seem to be various other things, but I will omit it because the above is the only thing that can be considered with the amount of knowledge so far.

Learning

model.fit(Jinkoudata)

fit() -Learning is done by specifying the learning data with fit.

Data frame generation

future = model.make_future_dataframe(periods=16,freq='M')

make_future_dataframe() -Create a future time series row by adding it to the data frame fitted as training data to the created model. ・ Here, in addition to the fitted data frame (Jinkoudata: 2013 / 04-2020 / 08), We have created data frames for the next 16 months (2020 / 09-2021 / 12). -As a result, [future] will be data with a period of 2013/04 to 2021/12.

future

ds
0 2013-04-01
1 2013-05-01
2 2013-06-01
・ ・ ・ ・ ・ ・
103 2021-10-31
104 2021-11-30

Forecast

predict = model.predict(future)

-Specify the time-series data frame created earlier as an argument and predict. -The time series to be predicted will be 2013/04 to 2021/12, which is the time series stored in future.

** Prediction result [predict] **

ds yhat yhat_lower yhat_upper
0 2013-04-01 205798.225901 205762.014035 205838.079234
1 2013-05-01 205717.024368 205679.523657 205759.642408
2 2013-06-01 205600.30351 205560.587079 205640.792342
3 2013-07-01 205530.899822 205488.087052 205570.056443
4 2013-08-01 205335.468464 205296.411528 205378.093757

There is other information that is predicted and stored in the data frame, Since it is difficult to grasp the numerical value, here we will check the items related to the predicted value. ・ Yhat: Predicted value ・ Yhat_lower: Predicted lower limit ・ Yhat_upper: Predicted upper limit The above columns store the Predicted Value, Predicted Lower Limit, and Predicted Upper Limit. "Forecast value" is the most probable prediction result, and "Forecast lower limit" and "Forecast upper limit" are the lower and upper limits as the fluctuation range of the predicted value.

analysis

model.plot(predict)

-Draw the prediction with [plot]. -The black dots are the training data, the blue lines are the predicted values (yhat), and the blue bands are the predicted fluctuation width (yhat_lower: yhat_upper).

model.plot(predict).png ** [What you can read from here] ** ■ As a result of the forecast, the number of people in 2021 is decreasing at the same pace as before (about 1000 people or less / year). ■ Breaking the 193,000 level by the end of 2021 ■ The accuracy of the prediction is high because the predicted value and the actual result are not separated (rather, the data is straightforward). ■ It is possible to make predictions with relatively small fluctuations in forecast values until about half a year ahead. (Learning data has expanded rapidly from 2021/03 onwards until 2020/08)

model.plot_components(predict)

-Drawing for each factor such as trend fluctuation and seasonal fluctuation is performed with [plot_components]. ・ The [trend] term is the trend (long-term trend fluctuation). It draws what kind of fluctuation it is as a yearly transition. ・ The [yearly] term changes seasonally. It depicts the fluctuations during the year. ・ Trends do not show much increase or decrease in a short period of time, and observe annual scale increase or decrease that repeats seasonal fluctuations.

model.plot_components(predict).png ** [What you can read from here] ** ■ It has been decreasing to a beautiful line shape for the past 8 years. There is no sign of rebounding on a yearly basis since the year of learning (2013) ■ As fluctuations within the year, basically the decrease will continue, but it will start to increase slightly in October and November. (Since seasonal fluctuations are analyzed and expressed as waveforms, when using monthly data like this one Note that the waveform is expected to become larger as the date interval of the data is larger.) (Although there is almost no increase in actual data, the rate of decrease has decreased significantly in October and November compared to others. Since it has shown a slight increase in recent years, it seems that there was a place where it turned to increase as a prediction) ■ Slightly, the downward trend is gradual (Although the decrease is sharp in March every year, the rate of decrease itself is also on the decline. For example, the decrease in March itself is due to employment / advancement events, It can be speculated that the reduction in the rate of decrease is due to the declining birthrate, To get the details, social dynamics like moving in and out, like the number of births and deaths Open data such as natural dynamics is required)

Evaluation

It is necessary to evaluate how much prediction accuracy the created model has. This time, we will evaluate using cross-validation.

Cross-validation

from fbprophet.diagnostics import cross_validation
df_cv = cross_validation(model, initial='1095 days', period='182 days', horizon = '365 days')

cross_validation() ・ Initial: When to start forecasting ・ Period: Period to shift data ・ Horizontal: Predicted period In the case of the above declaration example, starting from the 1095th day (initial), 1 to 1094 days of data for the training data period (3 times the horizon unless specified), The evaluation period is estimated to be 365 days (horizon) from the 1095th day. After analyzing the forecast, shift the start point of the forecast period back by 182 days (period), We forecast 365 days from the 1277th day as the next evaluation period. This is repeated until the end of the data and the prediction result is stored. This prediction result will be used as the evaluation result by replacing it with a scale that can be easily confirmed by the method described later.

** Cross-validation result [df_cv] **

ds yhat yhat_lower yhat_upper y cutoff
0 2016-09-01 199454.428910 199389.987532 199514.165966 199386 2016-08-05
1 2016-10-01 199437.182903 199145.883231 199744.554272 199236 2016-08-05
2 2016-11-01 199274.971653 198645.490967 199906.989114 199110 2016-08-05
3 2016-12-01 199073.586270 198066.379455 200065.689963 199006 2016-08-05
4 2017-01-01 198809.990576 197391.138491 200210.975003 198837 2016-08-05

Evaluation scale calculation

from fbprophet.diagnostics import performance_metrics
df_p = performance_metrics(df_cv)

performance_metrics() For the cross-validation result [df_cv], calculate the evaluation scale and store the result.

** Evaluation scale calculation result [df_p] **

horizon mse rmse mae mape mdape coverage
0 57 days 16078.849437 126.802403 109.436237 0.000554 0.000432 0.1250
1 58 days 20194.085707 142.105896 129.175240 0.000655 0.000627 0.1250
2 59 days 18026.561842 134.263032 113.963954 0.000578 0.000545 0.2500
3 60 days 20932.092431 144.679274 129.291089 0.000657 0.000687 0.2500
4 87 days 33783.385041 183.802571 161.495031 0.000819 0.000824 0.3750

The evaluation scale is a calculation of the results for each evaluation index, and the types are as follows.

** ● MSE [mean squared error] ** The average value of the square of the absolute value, which is the difference between the actual value and the predicted value.

** ● RMSE [root mean squared error] ** The square root of the mean square of the difference between the actual value and the predicted value.

** ● MAE [mean absolute error] ** The average of the absolute values of the difference between the actual value and the predicted value.

** ● MAPE [mean absolute percent error] ** The average of the ratio of the difference between the actual value and the predicted value to the actual value.

MSE, RMSE, and MAE are calculated differently, but the deviations in prediction are expressed on average based on actual measurements (population this time). MAPE expresses forecast deviations on a percentage basis. Since the numbers we are dealing with this time are large, we would like to confirm the evaluation results with MAPE (Mean Absolute Percent Error), which is easy to understand.

Evaluation scale visualization by MAPE (Mean Absolute Percent Error)

from fbprophet.plot import plot_cross_validation_metric
fig = plot_cross_validation_metric(df_cv, metric='mape')

plot_cross_validation_metric() You can visualize the evaluation scale calculated in the previous section. Horizon (evaluation period) specified in cross-validation on the X-axis The value of MAPE for the number of evaluations executed is drawn as a point, The average is displayed as a blue line.

plot_cross_validation_metric.png ** [What you can read from here] ** ■ The accuracy of the overall forecast is not low ⇒ Even if you look at the maximum outlier, mape is about 0.4%. (= Since it is a municipality with a scale of 200,000 people, a deviation of up to 800 people is estimated) If you follow the average, the error is changing between 0.05 and 0.15%, which is about 100 to 300 people. Since it is a prediction, there is no index that is correct, but According to the actual measurement data, there are many periods when there are monthly movements of 100 people, so Even if it is an error, it is a difference that can be filled in one aggregation period. You can judge that it is not a large error. ■ Accuracy decreases as the period elapses from the start of forecasting ⇒ Naturally, the accuracy tends to decrease (the variance of mape increases) as the months go by.

tuning

Room for improving model accuracy

Since the model you create is not always the best, you may want to improve the accuracy by making predictions using the model you have tuned. Here, I will also write the process of judging "Is there room for tuning?"

Tuning perspective

When deciding whether or not to perform tuning, it is necessary to consider based on the nature of the data used for training. This time, we will consider from the following perspectives.

** ● Additive seasonality / multiplicative seasonality ** Fluctuations in actual data are additive (the range and magnitude of fluctuations are constant regardless of the size of the objective variable) or multiplicative (the larger the objective variable, the greater the range of fluctuations). Or

** ● Number of change points extracted ** Is the number of change points in the actual data appropriate?

** ● Extraction range of change points ** Is the extraction range of the change points of the actual data appropriate?

Judgment of tuning to this model

** ● Additive seasonality / multiplicative seasonality ** The objective variable this time is population data. "Death" and "birth", which are considered to be one of the factors of population increase and decrease, are said to be elderly and young. Since it is thought that the magnitude of fluctuation will change depending on the population parameter of the population, ** We judge that this data has a slight multiplicative aspect **. (Although the total number is decreasing year by year from the plot of the prediction result, it is accompanied by it. You can see that the annual decrease is small, albeit only slightly). (Other than that, "moving in" and "moving out" are also affected by the population parameter of the population, In these cases, the number and scale of companies / schools that receive moving in and out are also strongly affected. It was difficult to conclude with only population data) Prophet modeling is "additive seasonality" or "additive seasonality" in the parameter [seasonality_mode] You can specify "multiplicative seasonality", but if omitted, the model will be created with "additive seasonality" by default. As a tuning, there is room to try creating a model with "multiplicative seasonality". ⇒ Tuning target </ font>

** ● Number of change points extracted ** Prophet automatically extracts change points from training data and uses them for prediction. The number of change points to be extracted is also 25 by default. The number of change points extracted can be specified at the time of model creation with the parameter [n_changepoints]. If this number is too small, "changes will occur at this time" Prediction accuracy drops due to lack of recognition of change points in time series, Even if this number is too large, slight fluctuations will be recognized as change points, and It is assumed that the flexibility of forecasting will be impaired and the accuracy will be reduced. In other words, if you specify a number, you want the model to train It is necessary to control the important change points, Events tend to be obscured with monthly data like this one. ** It is difficult to judge "how many are appropriate" **, so I think there is little room for tuning. ⇒ Not subject to tuning </ font>

** ● Extraction range of change points ** In Prophet, as the extraction range of change points, 80% from the start point of the training data is extracted as the target period. For example, if there is a sharp rise in the trend at the end of the training data, Since the default model cannot make predictions that take that change into account, In the parameter [changepoint_range] when creating a model Adjust the extraction range by specifying '1.0'. In this case, such ** recent big fluctuations Since it is not seen in particular **, there seems to be no room for tuning from the viewpoint of the extraction range of change points. ⇒ Not subject to tuning </ font>

Tuning: Modeling with multiplicative seasonality

As judged in the above section, we will create a model with "multiplicative seasonality" as tuning.

#Model generation (multiplicative seasonality)
#Added generation time parameters[seasonality_mode = "multiplicative"]
model_multi = Prophet(weekly_seasonality=False,daily_seasonality=False,seasonality_mode = "multiplicative")
#Learning
model_multi.fit(Jinkoudata)
#Forecast
predict_multi = model_multi.predict(future)
#Prediction result drawing
fig = model_multi.plot(predict_multi)

The drawing of the prediction result of the multiplicative model is as follows. model.plot(predict_multi).png model.plot_components(predict_multi).png The tendency looks similar to the additive model. We will also perform cross-validation for this.

#Cross-validation
df_cv_multi = cross_validation(model_multi, initial='1095 days', period='182 days', horizon = '365 days')
#Evaluation scale calculation
df_p_multi = performance_metrics(df_cv_multi)
#Evaluation scale visualization by MAPE (Mean Absolute Percent Error)
fig_multi = plot_cross_validation_metric(df_cv_multi, metric='mape')

The evaluation scale MAPE was as follows. plot_cross_validation_metric(df_cv_multi.png Slightly, really very little, but less diversified than the additive model, It became a model with little deviation from the prediction.

Draw the actual measurement data and the two types of prediction results in the same table and compare them.

#Table size setting
plt.figure(figsize=(12,6))
#Axis heading setting
plt.xlabel('Year month')
plt.ylabel('Jinkou')
#Specify drawing range
plt.xlim(82, 105)
plt.ylim(190000, 196000)
#Change the scale display
plt.xticks([82,88,94,100,105], ["2020-01","2020-07","2021-01","2021-07","2021-12"])
#Drawing of each numerical value
plt.plot(Jinkoudata.y,label = 'Jinkoudata')
plt.plot(predict.yhat,label = 'Predict_add')
plt.plot(predict_multi.yhat,label = 'Predict_multi')
plt.legend()

The drawing result was as below. plt.legend_data_add_multi.png Measured data used by [Jinkoudata] for learning, [Predict_add] is the prediction result in the additive seasonality model, [Predict_multi] is the prediction result in the multiplicative seasonality model. …… There is a question as to whether this is okay, but the lines of [Predict_add] and [Predict_multi] overlap, and the results are almost the same. Originally, the multiplicative nature of the measured data itself was slight, so it may not appear as a significant difference in the prediction results.

Summary

Modeling and forecasting itself was easier than I expected. However, in order to utilize it, it is still necessary to understand and know the nature and correlation of data. The forecast period that can be trusted as a model for future forecast was not longer than I expected, so it is an operation to read the trend from the data for each unit period and tune each time (although it seems that it is not limited to this example). I thought it would be a prerequisite.

I tried to predict the future of the population as a practice, but since the decrease was clear from the original data, I think that it was not worth the prediction so far. Choosing a theme is still important. If you want to dig deeper, you can use external variables to improve the accuracy, but you may want to try the correlation analysis separately. Numazu is my favorite city, so I thought it might be a good idea to make predictions after studying other fields.