We will analyze the number of corona patients in Japan and create a predictive model.

here ・ How to use the library called Prophet provided by Facebook ・ Random forest regression ・ XGBoost I will try three methods.

environment

Use Jupyter Notebook which is installed when you install Anaconda on Windows 10

version information: Jupyter Notebook 6.1.4 Python 3.8.5

Library import

fbprophet, plotly, and xgboost with Anacona Prompt respectively 　conda install -c conda-forge fbprophet 　pip install plotly 　pip install xgboost After installing with, I imported the library on Jupyter Notebook by the following.

import pandas as pd
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
pd.pandas.set_option('display.max_columns', None)
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel
pd.pandas.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
from fbprophet import Prophet
import plotly.express as px
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.model_selection import ParameterGrid
from tqdm import tqdm
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

Data to use

--nhk_news_covid19_domestic_daily_data.csv: Corona patient number data released by NHK

Data provider: https://www3.nhk.or.jp/news/special/coronavirus/data-all/

Display patient count data as a data frame:

path = "../input/covid19data_from_NHK/nhk_news_covid19_domestic_daily_data.csv"
df_Jap = pd.read_csv(path,encoding = 'utf-8')
df_Jap

Graph display:

fig = px.area(df_Jap, x="date", y="Number of infected people in Japan_Number of announcements per day", height=600, width=700,
             color_discrete_sequence = ["blue"])
fig.update_layout(xaxis_rangeslider_visible=True)
fig.show()

You can see that the first wave around April 2020, the second wave around August, and the third wave around December are coming.

Training data and test data

I divide the data into training data and test data, but here Training data: 2020/7/14-2020/11/30 Test data: 2020/12/1-2020/12/31 will do. The reason for setting it from July 14, 2020 is that before that, the number of patients was close to 0 and the characteristics of the data would change.

Here, the "date" column is converted to datetime64 type by the to_datetime method of pandas to make it easier to handle.

df_Jap['Date (datetime64)'] = pd.to_datetime(df_Jap['date'], errors='coerce')
import datetime as dt
Train_Jap=df_Jap[(dt.datetime(2020,7,14) <= df_Jap['Date (datetime64)']) & (df_Jap['Date (datetime64)']  <= dt.datetime(2020,11,30) )]
Test_Jap=df_Jap[(dt.datetime(2020,12,1) <= df_Jap['Date (datetime64)']) & (df_Jap['Date (datetime64)']  <= dt.datetime(2020,12,31) )]

Learning with Prophet

Prophet is a time series forecasting library developed by Facebook, details are described below. https://facebook.github.io/prophet/docs/quick_start.html

Feature: --Prophet input must be a dataframe with columns ds and y --ds is the date stamp (date), y is the numerical measurement data you want to predict

Renamed for use with Prophet:

Train_Jap=Train_Jap[["date","Number of infected people in Japan_Number of announcements per day"]].rename(columns={"date":"ds","Number of infected people in Japan_Number of announcements per day":"y"})
Test_Jap=Test_Jap[["date","Number of infected people in Japan_Number of announcements per day"]].rename(columns={"date":"ds","Number of infected people in Japan_Number of announcements per day":"y"})

Prophet settings

The Prophet model is trained against the training data (Train_Jap) and predicted against the test data (Test_Jap), but the following settings can be made.

growth

You can select growth ='linear' and growth ='logistic' (logistic function). With growth ='logistic', you can set the upper limit ('cap' (carrying capacity)) and lower limit of the data, which is used here. 'cap' is set to 10000 here because of the visibility of the graph. (This is fine as long as the number of infected people per day does not exceed 10,000)

Train_Jap['cap']=10000
Test_Jap['cap']=10000

changepoints

As a trend change point '2020/8/7', '2020/9/7', '2020/10/24', '2020/11/20' To set. (Read visually from the graph)

seasonality_mode

seasonality_mode ='additive' seasonality_mode ='multiplicative' You can choose between the two, but if you look at the graph above, the weekly periodicity increases in proportion to the number of infected people, so the multiplicative seasonality is adopted.

seasonality_prior_scale

A parameter that adjusts the degree to which the periodicity of the model fits the data. First, let's set it to 1 temporarily, but later we will try to enter multiple values by grid search.

changepoint_prior_scale

It represents the variance of the Laplace distribution, which is the prior distribution of the trend term, and the larger it is, the easier it is to detect the change point. If you make this term too large, quite a few points will be detected as change points. First, let's set it to 1 temporarily, but later we will try to enter multiple values by grid search.

Information source: https://www.atmarkit.co.jp/ait/articles/1906/07/news004_2.html

interval_width

Represents the width of the uncertainty interval, which is 0.8 by default. I will try to enter multiple values later in the grid search.

Execution of learning

model=Prophet(growth='logistic', seasonality_mode='multiplicative', seasonality_prior_scale=1, changepoint_prior_scale=1, changepoints=['2020/8/7', '2020/9/7', '2020/10/24', '2020/11/20'])
model.fit(Train_Jap)
forecast = model.predict(Test_Jap)
fig = model.plot_components(forecast)

Trends (movements excluding changes in annual and weekly cycles) and weekly learning results are available ↓ The annual cycle is not displayed because there is no data for one year. I knew that the number of infected people would decrease on Monday due to the small number of tests on Sunday, but on the contrary, it seems that there is a tendency for the number of people to be infected on Thursday, Friday and Saturday to increase:

View training data and forecast results. The change point of the set trend is displayed as a red dashed line, the forecast trend line is displayed as a red line, and the forecast result is displayed as a blue line.

from fbprophet.plot import add_changepoints_to_plot
plot = model.plot(forecast)
a = add_changepoints_to_plot(plot.gca(), model, forecast)
plt.ylim([-0, 5000])

Add the predicted value yhat obtained by training to Test_Jap as a column:

Test_Jap['yhat']=forecast['yhat'].values

Compare the actual number of infected people in December with the predicted results:

plt.figure(figsize=(20, 8))
plt.plot(Test_Jap['ds'], Test_Jap['y'], 'b-', label = 'Actual')
plt.plot(Test_Jap['ds'], Test_Jap['yhat'], 'r--', label = 'Prediction')
plt.xlabel('Date',rotation=90); plt.ylabel('Sales'); plt.title('Actual vs Prediction')
plt.xticks(rotation=90)
plt.legend();

Let's calculate the accuracy:

Test_Jap['diff']=(Test_Jap.y-Test_Jap.yhat).abs()
acc_ts2=(1-(Test_Jap['diff'].sum()/Test_Jap['y'].sum()))*100
acc_ts2

#85.55183564855452

The accuracy this time was about 86%. As an index MAE (Mean Absolute Error) MSE (Mean Squared Error) RMSE (Root Mean Square Error) Let's calculate.

MAE_ts=metrics.mean_absolute_error(Test_Jap['y'], Test_Jap['yhat'])
MSE_ts=metrics.mean_squared_error(Test_Jap['y'], Test_Jap['yhat'])
RMSE_ts=np.sqrt(metrics.mean_squared_error(Test_Jap['y'], Test_Jap['yhat']))
print('MAE:', MAE_ts)
print('MSE:', MSE_ts)
print('RMSE:', RMSE_ts)

MAE: 404.44140578238205
MSE: 305342.084316079
RMSE: 552.5776726543328

I don't think it's bad already, but we will tune the parameters by grid search below.

Grid search

I have set the following values for'changepoint_prior_scale','seasonality_prior_scale', and'interval_width'.

params_grid = {'changepoint_prior_scale':[0.1, 0.5, 1, 2, 10],
               'seasonality_prior_scale':[0.1, 0.5, 1, 2, 10],
               'interval_width':[0.8, 0.85, 0.9, 0.95]
              }
grid = ParameterGrid(params_grid)

model_parameters = pd.DataFrame(columns = ['Acc','Parameters'])
for p in tqdm(grid):
    
    Train=Train_Jap.copy()
    Valid=Test_Jap[['ds','y','cap']].reset_index()
            
    m=Prophet(growth='logistic', 
              seasonality_mode='multiplicative', 
              seasonality_prior_scale=p['seasonality_prior_scale'], 
              changepoint_prior_scale=p['changepoint_prior_scale'], 
              changepoints=['2020/8/7', '2020/9/7', '2020/10/24', '2020/11/20'],
              interval_width = p['interval_width']
             )
            
    m.fit(Train_Jap)
            

    forecast = m.predict(Valid[['ds','cap']])
    forecast = forecast.astype({"ds": object,"cap": object})
    Valid=pd.concat([Valid.reset_index()[['ds','y']],forecast['yhat']], axis=1)
    
    #performance metric
    Valid['diff']=(Valid.y-Valid.yhat).abs()
    acc=(1-((Valid['diff'].sum()/Valid['y'].sum())))*100
    
    model_parameters = model_parameters.append({'Acc':acc,'Parameters':p},ignore_index=True)
            
parameters = model_parameters.sort_values(by=['Acc'],ascending=False)
parameters = parameters.reset_index(drop=True)
        
best_parameters=parameters['Parameters'][0]

The best parameter is 'changepoint_prior_scale': 0.5, 'interval_width': 0.8, 'seasonality_prior_scale': 2 have become. Let's try graph drawing and accuracy calculation again using these parameters.

m = Prophet(growth='logistic', 
            seasonality_mode='multiplicative', 
            seasonality_prior_scale=best_parameters['seasonality_prior_scale'],
            changepoint_prior_scale=best_parameters['changepoint_prior_scale'],
            changepoints=['2020/8/7', '2020/9/7', '2020/10/24', '2020/11/20'],
            interval_width =best_parameters['interval_width']
            )
m.fit(Train_Jap)
forecast=m.predict(Test_Jap)

from fbprophet.plot import add_changepoints_to_plot
plot = model.plot(forecast)
a = add_changepoints_to_plot(plot.gca(), model, forecast)
plt.ylim([-0, 5000])

Test_Jap['yhat']=forecast['yhat'].values
plt.figure(figsize=(20, 8))
plt.plot(Test_Jap['ds'], Test_Jap['y'], 'b-', label = 'Actual')
plt.plot(Test_Jap['ds'], Test_Jap['yhat'], 'r--', label = 'Prediction')
plt.xlabel('Date',rotation=90); plt.ylabel('Sales'); plt.title('Actual vs Prediction')
plt.xticks(rotation=90)
plt.legend();

Test_Jap_best['diff']=(Test_Jap_best.y-Test_Jap_best.yhat).abs()
acc_ts2=(1-(Test_Jap_best['diff'].sum()/Test_Jap_best['y'].sum()))*100
acc_ts2

#90.92890097106074

The accuracy was about 91% and the MAE, MSE, and RMSE were as follows:

MAE_ts=metrics.mean_absolute_error(Test_Jap['y'], Test_Jap['yhat'])
MSE_ts=metrics.mean_squared_error(Test_Jap['y'], Test_Jap['yhat'])
RMSE_ts=np.sqrt(metrics.mean_squared_error(Test_Jap['y'], Test_Jap['yhat']))
print('MAE:', MAE_ts)
print('MSE:', MSE_ts)
print('RMSE:', RMSE_ts)

#MAE: 253.92347110782634
#MSE: 111763.48089497152
#RMSE: 334.31045585648604

Learning with random forest regression

In using random forest regression, we will use the month, day, and day of the week as explanatory variables (here, variables that are thought to affect the number of patients with corona).

It may be more accurate to use the number of PCR tests and the number of travelers from overseas as explanatory variables here, but when predicting the number of corona patients beyond the present, such data cannot be used, so we will use it here. No.

def create_date_features(df):
    df['Moon']=df['Date (datetime64)'].dt.month
    df['Day']=df['Day付（datetime64）'].dt.day
    df['Day of the week (number)'] = df['Date (datetime64)'].dt.weekday
    return df

reg_Jap=create_date_features(df_Jap)

def getdayofweek(day):
    if (day == 0):
        return "Monday"
    elif(day == 1):
        return "Tuesday"
    elif(day ==2):
        return "Wednesday"
    elif(day == 3):
        return "Thursday"
    elif(day ==4):
        return "Friday"
    elif(day == 5):
        return "Saturday"
    elif(day ==6):
        return "Sunday"

reg_Jap['Day of the week'] = reg_Jap['Day of the week（番号）'] .apply(getdayofweek)
reg_Jap=pd.get_dummies(reg_Jap,columns=['Day of the week'])
reg_Jap

Training data:

Train_reg_Jap=reg_Jap[(dt.datetime(2020,7,14) <= reg_Jap['Date (datetime64)']) & (reg_Jap['Date (datetime64)']  <= dt.datetime(2020,11,30) )]
x_train_reg_Jap=Train_reg_Jap[['Moon','Day','曜Day_Moon曜','曜Day_Tuesday','曜Day_Wednesday','曜Day_Thursday','曜Day_Friday','曜Day_Saturday','曜Day_Day曜']]
y_train_reg_Jap=Train_reg_Jap[['Number of infected people in Japan_Number of announcements per day']]

test data:

Test_reg_Jap=reg_Jap[(dt.datetime(2020,12,1) <= reg_Jap['Date (datetime64)']) & (reg_Jap['Date (datetime64)']  <= dt.datetime(2020,12,31) )]
x_test_reg_Jap=Test_reg_Jap[['Moon','Day','曜Day_Moon曜','曜Day_Tuesday','曜Day_Wednesday','曜Day_Thursday','曜Day_Friday','曜Day_Saturday','曜Day_Day曜']]
y_test_reg_Jap=Test_reg_Jap[['Number of infected people in Japan_Number of announcements per day']]

Perform learning. First, set only n_estimators = 100 as a parameter, and use the default values for the others.

--n_estimators: Number of decision trees

The parameters for random forest regression can be found here. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor ：

rf_Jap = RandomForestRegressor(n_estimators=100)
rf_Jap.fit(x_train_reg_Jap,y_train_reg_Jap)
pred_rf = rf_Jap.predict(x_test_reg_Jap)

Accuracy calculation:

y_test_reg_Jap['diff']=abs((y_test_reg_Jap['Number of infected people in Japan_Number of announcements per day'].values-pred_rf))
acc_rf=(1-(y_test_reg_Jap['diff'].sum()/y_test_reg_Jap['Number of infected people in Japan_Number of announcements per day'].sum()))*100
acc_rf

＃57.238258785021955

It became a low value of about 57%.

Grid search

--max_features: Number of features of decision tree

For “auto”, max_features = n_features, For “sqrt”, max_features = sqrt (n_features), For “log2”, max_features = log2 (n_features) Will be.

--min_samples_split: Minimum number of samples when splitting a tree

--bootstrap: Whether to use bootstrap sampling (a method to allow duplication and retrieve data randomly)

param_grid = { 
        "n_estimators"      : [10,20,300,100,200,500],
        "max_features"      : ["auto", "sqrt", "log2"],
        "min_samples_split" : [2,4,6,8],
        "bootstrap": [True, False],
            }
grid = ParameterGrid(param_grid)

model_parameters = pd.DataFrame(columns = ['Acc','Parameters'])
for p in tqdm(grid):
    
    X_Train=x_train_reg_Jap.copy()
    Y_Train=y_train_reg_Jap.copy()
    X_Valid=x_test_reg_Jap.copy()
    Y_Valid=y_test_reg_Jap.copy()
    m = RandomForestRegressor(n_estimators = p['n_estimators'],
               max_features = p['max_features'],
               min_samples_split = p['min_samples_split'],
               bootstrap=p['bootstrap']
        )
    
    m.fit(X_Train,Y_Train)
    pred_rf2 = m.predict(X_Valid)
            
    Y_Valid['yhat']=pred_rf2
    
    #performance metric
    Y_Valid_diff=(Y_Valid['Number of infected people in Japan_Number of announcements per day']-Y_Valid.yhat).abs()
    acc=(1-((Y_Valid_diff.sum()/Y_Valid['Number of infected people in Japan_Number of announcements per day'].sum())))*100
    
    model_parameters = model_parameters.append({'Acc':acc,'Parameters':p},ignore_index=True)
            
parameters = model_parameters.sort_values(by=['Acc'],ascending=False)
parameters = parameters.reset_index(drop=True)
        
best_parameters=parameters['Parameters'][0]

m2 = RandomForestRegressor(
        bootstrap=best_parameters['bootstrap'],
        max_features=best_parameters['max_features'],
        min_samples_split=best_parameters['min_samples_split'],
        n_estimators=best_parameters['n_estimators']
        )
m2.fit(x_train_reg_Jap,y_train_reg_Jap)
pred_rf2 = m2.predict(x_test_reg_Jap)

y_test_reg_Jap_diff2=abs((y_test_reg_Jap['Number of infected people in Japan_Number of announcements per day'].values-pred_rf2))
acc_rf2=(1-(y_test_reg_Jap_diff2.sum()/y_test_reg_Jap['Number of infected people in Japan_Number of announcements per day'].sum()))*100
acc_rf2

#57.96572037955858

Even if I used the best parameters, it was almost the same as about 57%. In addition, due to the randomness of random forest regression, there were cases where the best parameter was lower in accuracy in the above code.

Learning with XGBoost

XGBoost (extreme gradient boosting) is a type of gradient boosting that uses the error between the measured value and the predicted value when creating a new decision tree using the results of the previous decision tree. Is a feature.

The parameters of XGBoost are described here. https://xgboost.readthedocs.io/en/latest/parameter.html

Also, the explanation on this site was easy to understand. http://kamonohashiperry.com/archives/209

--nthread: Number of parallel threads used for calculation --learning_rate: Learning rate --max_depth: Maximum depth of decision tree --min_child_weight: The minimum total value of data weighting in child nodes --subsample: Percentage of samples randomly extracted in each tree --colsample_bytree: Percentage of columns randomly extracted in each tree

The data uses x_train_reg_Jap, y_train_reg_Jap, x_test_reg_Jap, and y_test_reg_Jap used in random forest regression. (Explanatory variables are month, day and day of the week)

Execution of learning:

xgb_Jap = XGBRegressor(n_estimators=100)
xgb_Jap.fit(x_train_reg_Jap,y_train_reg_Jap)
xgb_pred = xgb_Jap.predict(x_test_reg_Jap)

Accuracy calculation:

y_test_reg_us['diff']=(y_test_reg_us['Number of infected people in Japan_Number of announcements per day']-xgb_pred).abs()
acc_xg=(1-(y_test_reg_us['diff'].sum()/y_test_reg_us['Number of infected people in Japan_Number of announcements per day'].sum()))*100
acc_xg
#57.22788239483825

Grid search

param_grid = { 
            'nthread':[4], #when use hyperthread, xgboost may become slower,
            'learning_rate': [.03, 0.05, .07], #so called `eta` value
            'max_depth': [5, 6, 7],
            'min_child_weight': [1,4],
            'subsample': [0.7],
            'colsample_bytree': [0.7],
            'n_estimators': [100,200,500]
            }
grid = ParameterGrid(param_grid)

model_parameters = pd.DataFrame(columns = ['Acc','Parameters'])
for p in tqdm(grid):
    
    X_Train=x_train_reg_Jap.copy()
    Y_Train=y_train_reg_Jap.copy()
    X_Valid=x_test_reg_Jap.copy()
    Y_Valid=y_test_reg_Jap.copy()
    m = XGBRegressor(nthread = p['nthread'],
               learning_rate = p['learning_rate'],
               max_depth=p['max_depth'],
               min_child_weight = p['min_child_weight'],
               subsample = p['subsample'],
               colsample_bytree=p['colsample_bytree'],
               n_estimators=p['n_estimators']             )
    
    
    
    m.fit(X_Train,Y_Train)
    pred_xg2 = m.predict(X_Valid)
            
    Y_Valid['yhat']=pred_xg2
    
    #performance metric
    Y_Valid['diff']=(Y_Valid['Number of infected people in Japan_Number of announcements per day']-Y_Valid.yhat).abs()
    acc=(1-((Y_Valid['diff'].sum()/Y_Valid['Number of infected people in Japan_Number of announcements per day'].sum())))*100
    
    model_parameters = model_parameters.append({'Acc':acc,'Parameters':p},ignore_index=True)
            
parameters = model_parameters.sort_values(by=['Acc'],ascending=False)
parameters = parameters.reset_index(drop=True)
        
best_parameters=parameters['Parameters'][0]

m = XGBRegressor(
        nthread=best_parameters['nthread'],
        learning_rate=best_parameters['learning_rate'],
        max_depth=best_parameters['max_depth'],
        min_child_weight=best_parameters['min_child_weight'],
        subsample=best_parameters['subsample'],
        colsample_bytree=best_parameters['colsample_bytree'],
        n_estimators=best_parameters['n_estimators']
        )
m.fit(x_train_reg_Jap,y_train_reg_Jap)
pred_xg2 = m.predict(x_test_reg_Jap)

y_test_reg_Jap['diff']=(y_test_reg_Jap['Number of infected people in Japan_Number of announcements per day']-pred_xg2).abs()
acc_xg2=(1-(y_test_reg_Jap['diff'].sum()/y_test_reg_Jap['Number of infected people in Japan_Number of announcements per day'].sum()))*100
acc_xg2

#57.36199327011835

Again, the best parameters are almost the same as the previous values.

Summary

・ Prophet ・ Random forest regression ・ XGBoost In these three methods, a model was trained using data on the number of Japanese coronaviruses infected from July 14th to November 30th in 2020, and the number of infected people from December 1st to December 31st was predicted. The accuracy was about 91%, but it was as low as 57% for random forest regression and XGBoost.

I think it's not unreasonable because I could only use the month, day, and day of the week as explanatory variables.

My impression is that Prophet reads the pattern of the 2D graph and predicts it, and I think that it is more suitable than random forest regression and XGBoost in predicting the number of corona patients like this time.

[PYTHON] Analyzing data on the number of corona patients in Japan

environment

Library import

Data to use

Training data and test data

Learning with Prophet

Prophet settings

Execution of learning

Grid search

Learning with random forest regression

Grid search

Learning with XGBoost

Grid search

Summary