[PYTHON] Handle various date formats with pandas

Since it is during the new life support period, I would like to talk about how to use "pandas", especially the date format. There is a reputation that "Python is strong in data analysis", but this is realized by the following "standard" packages.

  • "pandas" for representing and analyzing data
  • "NumPy" for basic numeriacal computation
  • "SciPy" for scientific computation including statistics
  • "StatsModels" for regression and other statistical analysis
  • "matplotlib" for visualization (The above is quoted from "Think Stats".)

pandas is a package for manipulating Series objects (one-dimensional data structures) and DataFrame objects (two-dimensional data structures), and originally supports the functions required to handle economic data such as stock prices. It is a library that is strong against so-called time series data.

However, looking at the time-series data that is actually on the market, there are cases where the date column contains "2016-03-28" in the standard format, but in some cases it is "28-Mar-16". In some cases, it is difficult to handle. In this article, I would like to confirm how to handle various dates.

(The programming environment is Python 2.7.11 + Jupyter + IPython kernel and Python 3.5.1 + Jupyter + IPython kernel, pandas 0.18.0.)

Convert dates when entering as much data as possible

Here, it is assumed that the time series data is read as a CSV (comma separated values) file. In the analysis work of time series data, I think that the time and date will be used as the index of the data structure. In pandas, read_csv () is used as follows.

When you want to input the following CSV file

Date,Open,High,Low,Close
2014/12/31,17702.11914,17713.75977,17450.76953,17450.76953
2014/12/30,17702.11914,17713.75977,17450.76953,17450.76953
2014/12/29,17914.55078,17914.55078,17525.66016,17729.83984
2014/12/26,17778.91016,17843.73047,17769.00977,17818.96094
 . . . 
(Omitted)

If you use the following code,

df1 = pd.read_csv('./pandas_date_ex/example1.csv', index_col='Date', parse_dates='Date')
df1.head()         # for check

The contents input to df1 are as follows. pandas_date_index1.PNG

The point to confirm here is that the leftmost "Date" must be a Timestamp type for handling Date as a numerical value, not a string type. (I don't want to use the "character string" index when performing interpolation and regression analysis in the flow of analysis work.)

>>> type(df1.index[0])
pandas.tslib.Timestamp

You can see that the date is properly converted to the Pandas.Timestamp type as above. This is because the following options specified in read_csv () worked correctly.

--index_col ='Date': Set the'Date' column as the index of DataFrame. --parse_dates ='Date': Scan the'Date' column and convert it to Datetime.

As mentioned above, the standard date data format (ex. '2014-12-31') worked well.

Date format situation that seems a little difficult

Next, handle the following "example2.csv". First, to check the contents, enter the file without any options.

df20 = pd.read_csv('./pandas_date_ex/example2.csv')
df20.head()

pandas_date_index2.PNG

At this point,'Date' is not indexed and the variable type remains a string (str). As you can see, this time it seems to be a little difficult to decipher like '15 -Mar-16'. There are various date formats in the world, including this example, so it seems necessary to consider how to deal with them.

Various cases of date notation

--The case where the order of year, month, and day changes. --When the month and day are 1-digit numbers, they are padded with zeros to make the number of digits 2 digits, or they are displayed as 1 digit. --The year is displayed in 4 digits or the last 2 digits. --Case where the month is written in numbers and letters (abbreviation of 3 letters)

Now, let's try the method used in "example1.csv" earlier in "example2.csv". I will change the way of adding options a little, but this is just changing the column specification from the column name ('Date') to the column number (= 0).

df2 = pd.read_csv('./pandas_date_ex/example2.csv', index_col=0, parse_dates=0)
df2.head()

pandas_date_index3.PNG

I found it difficult to interpret the date, but it worked unexpectedly. The date parser used by pandas.read_csv () looks pretty good.

Actually, I prepared the following code in anticipation of the case where the date conversion does not work. (... Created by referring to the Q & A site / stackoverflow.)

f2 = '%d-%b-%y'
my_parser = lambda date: pd.datetime.strptime(date, f2)
df21 = pd.read_csv('example2.csv', index_col=0, parse_dates=0, 
                    date_parser=my_parser)

You can prepare your own parser and use read_csv () to use it. But this time, the above method worked, so this code didn't come into play.

What is the date format including Japanese?

I don't think there are many, but I would like to find out about date formats including Japanese. Try the same method as before.

df31 = pd.read_csv('./pandas_date_ex/example3.csv', index_col='Date', parse_dates='Date')
df31.head()

pandas_date_index4.PNG

It seems that it can be indexed, but Japanese is included in the Date column.

>>> type(df31.index[0])
str

Again, the'Date' column remained a string type. In other words, the parse_dates option specified in read_csv () is not working. (It seems that it is not a specification that causes an exception and interrupts work.)

In this case, you still need your own parser.

f3 = '%Y year%m month%d day'
my_parser = lambda date: pd.datetime.strptime(date, f3)
df3 = pd.read_csv('./pandas_date_ex/example3.csv', index_col=0, parse_dates=0, 
                    date_parser=my_parser)
df3.head()

pandas_date_index5.PNG

It went well. Finally, let's plot (a part of) this data.

df3[['High', 'Low']].plot(figsize=(8,4), grid=True)

pandas_date_index6.png

There seems to be no problem. (The plot function of pandas is a wrapper of the matplotlib library, so it looks like the figure above.)

As mentioned above, the date parser in pandas read_csv () seems to be able to handle a fairly wide range. We also found that special cases such as dates including Japanese such as "year" and "month" can be handled by preparing your own parser. (As an application, you should be able to convert the year with the Japanese era to the Christian era by writing a parser.)

In addition, it is necessary to specify the reading format in the own parser, but it is said that this conforms to the specification of strftime () in C language. (Reference: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)

For details, please check the document, and I will extract the date format that seems to be used frequently. (Directives have different roles depending on the case.)

Directive meaning Example
%d 0 fill(zero padding)The date in the month expressed in decimal. 01, 02, ..., 31
%b The month name of the locale is displayed in abbreviated form. Jan, Feb, ..., Dec (en_US); Jan, Feb, ..., Dez (de_DE)
%B Displays the month name of the locale. January, February, ...,
%m Month in decimal number filled with 0. 01, 02, ..., 12
%y Year with a century (2 digits) expressed in 0-filled decimal numbers. 00, 01, ..., 99
%Y Year(4 digits)Represents the decimal notation of. 2011, 2012, 2013, ...

I've gotten into a little detail, but pandas is very versatile and is used in many places, not just time-series data. I think that the workflow of preprocessing data using pandas and inputting the formatted data to machine learning and deep learning frameworks is also common.

(Notice about understanding date-related classes)

The following are the most common date-related classes handled by Python. --datetime.datetime class --numpy.datetime64 class --pandas.Timestamp class

With my poor understanding, I wrote this article assuming that these three classes (especially datetime.datetime and pandas.Timestamp) are about the same. But strictly speaking, they may not be the same. (I don't think it will affect the behavior of the code posted in the article, but if there are any mistakes or inaccurate descriptions, I would like to correct the article. If you have any suggestions, please.)

References (website)

Recommended Posts

Handle various date formats with pandas
Handle integer types with missing values in Pandas
Quickly visualize with Pandas
Processing datasets with pandas (1)
Bootstrap sampling with Pandas
Convert 202003 to 2020-03 with pandas
Merge datasets with pandas
Learn Pandas with Cheminformatics
Handle Excel with python
Handle rabbimq with python
Data visualization with pandas
Data manipulation with Pandas!
Shuffle data with pandas
Get date with python
Various colorbars with Matplotlib
Ingenuity to handle data with Pandas in a memory-saving manner
Read csv with python pandas
Load nested json with pandas
Try various things with PhantomJS
[Python] Change dtype with pandas
[Tips] Handle Athena with Python
One-liner addition with various scripts
Manipulate various databases with Python
Standardize by group with pandas
Handle numpy arrays with f2py
Various memorandums developed with Choregraphe
Handle JSON files with Matlab
Prevent omissions with pandas print
Data processing tips with Pandas