[PYTHON] Analyze Kaggle corona patient count data

Kaggle's Pelin Ay's notebook has created an American corona patient number prediction model. Here, we will use this as a reference to create a model for predicting the number of corona patients in Japan. https://www.kaggle.com/pelinay/covid19-forecasting/

It was written for the following kaggle competition (ending on 5/12/2020). https://www.kaggle.com/c/covid19-global-forecasting-week-5

environment: Use Jupyter Notebook, which is installed with Anaconda on Windows

Data to use

--train.csv: Training data

--test.csv: Test data

--submission.csv: Format file for submission of competition

Data provider: https://www.kaggle.com/c/covid19-global-forecasting-week-5/data

It contains characteristic information of each country, and the information about Japan is as follows. image.png

Data provider: https://www.kaggle.com/ishivinal/covid19-useful-features-by-country

I created a kaggle folder in the root folder of Jupyter, and put the data in the input folder and the notebook file in the code folder. The data is read on the Juptyter Notebook as follows.

path = "../input/covid19-global-forecasting-week-5/train.csv"
path2 = "../input/covid19-global-forecasting-week-5/test.csv"
path3="../input/covid19-useful-features-by-country/Countries_usefulFeatures.csv"
path4="../input/covid19-global-forecasting-week-5/submission.csv"

df_train = pd.read_csv(path,encoding = 'unicode_escape')
df_test = pd.read_csv(path2,encoding = 'unicode_escape')
df_count_feat=pd.read_csv(path3,encoding = 'unicode_escape')
df_sub=pd.read_csv(path4,encoding = 'unicode_escape')

Library import

fbprophet, plotly, and xgboost with Anacona Prompt respectively  conda install -c conda-forge fbprophet  pip install plotly  pip install xgboost After installing with, I imported the library on Jupyter Notebook by the following.

import pandas as pd
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
pd.pandas.set_option('display.max_columns', None)
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel
pd.pandas.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
from fbprophet import Prophet
import plotly.express as px
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.model_selection import ParameterGrid
from tqdm import tqdm
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

Prophet Prophet is a time series forecasting library developed by Facebook, details are described below. https://facebook.github.io/prophet/docs/quick_start.html

Feature: --Prophet input must be a dataframe with columns ds and y --ds is the data frame, y is the numerical measurement data you want to predict

Data analysis

In Ay's In [15], the graphs of the three countries of US, Brazil, and Russia are displayed, but since we are analyzing Japan here, we changed it as follows.

In[16]Fix


df_Jap=df_train[df_train['Country_Region']=='Japan']
df_Jap=df_Jap[df_Jap['Target']=='ConfirmedCases']
df_Jap=df_Jap.rename(columns={"Date":"Date","TargetValue": "Japan_TotalCase"})
df_Jap=df_Jap[["Date","Japan_TotalCase"]]
df_plot=df_Jap.rename(columns={"Date":"Date","TargetValue": "Jap_TotalCase"})
df_plot=df_plot[["Date","Japan_TotalCase"]]

image.png

(Article is being created)

df_Jap=df_train[df_train['Country_Region']=='Japan'][df_train['Target']=='ConfirmedCases']

plt.figure(figsize=(20, 10))
sns.lineplot(data=df_Jap[df_Jap['Date']<"2020-05-01"], x="Date", y="TargetValue")
plt.xticks(rotation=90);

image.png

Let the data before 2020/5/12 be the training data and the data after it be the test data:

Train_Jap=df_Jap[df_Jap["Date"]<"2020-05-12"]
Test_Jap=df_Jap[df_Jap["Date"]>="2020-05-12"]

Rename the "Date" column to "ds" and the "TargetValue" column to "y" to use the data in Prophet:

Train_Jap=Train_Jap[["Date","TargetValue"]].rename(columns={"Date":"ds","TargetValue":"y"})
Test_Jap=Test_Jap[["Date","TargetValue"]].rename(columns={"Date":"ds","TargetValue":"y"})

I am performing training with Prophet, but I can select growth ='linear' (linear) and growth ='logistic' (logistic function). Both patterns are described below.

Prophet: When growth ='linear'

model=Prophet(growth='linear',changepoint_prior_scale=60)
model.fit(Train_Jap)
forecast = model.predict(Test_Jap)
fig = model.plot_components(forecast)

Prophet's plot_components method displays the prediction results as trend (trend excluding week and year cycles), weekly (week cycle), and yearly (year cycle). Here, yearly is not displayed because there is no test data for one year.

image.png

Display of forecast results:

plot = model.plot(forecast)

image.png

If the data up to 5/12 in Japan shows growth ='linear', the predicted number of infected people has become negative. I found out that I need to use growth ='logistic' to prevent it from becoming less than 0, so I will do it with growth ='logistic' below.

Prophet: When growth ='logistic'

"logistic" requires a'cap' (carrying capacity) column in the data. Carrying capacity represents the limit value of y, but here we will enter the estimated value of 120 million Japanese people. Also, the minimum value of y can be specified in the'floor' column, and it seems that the default is 0, but here, 0 is explicitly entered.

Train_Jap['cap']=120000000
Test_Jap['cap']=120000000
Train_Jap['floor']=0
Test_Jap['floor']=0
model=Prophet(growth='logistic',changepoint_prior_scale=60)
model.fit(Train_Jap)
forecast = model.predict(Test_Jap)
fig = model.plot_components(forecast)

image.png

It is a model execution, but if it is left as it is, the width of the y-axis will increase depending on the value of cap, so limit it with ylim:

plot = model.plot(forecast)
plt.ylim([-100, 1300])

image.png

It feels better than linear, but there are some areas where it is below 0. Even if the value of trend can be set to 0 or more, it is because there is a weekly contribution. I will update it if I find a way to improve it.

This time I used a little old data for studying, but I will analyze it again with the latest data on the number of corona patients.

Recommended Posts

Analyze Kaggle corona patient count data
Let's analyze Covid-19 (Corona) data using Python [For beginners]