[PYTHON] Precautions when operating with string for TmeStampType of PySpark

Introduction

It is said that there is a pattern that behaves unintentionally if the date format string type is used carelessly when calculating TimeStampType of PySpark.

Therefore, it is possible to calculate with string for TimeStampType, but it is safer to use datetime.

Example

A specific example will be described.

The example shown here is from PySPark 2.4.4.

Verification data

With the following code, create a Spark DataFrame with date data from January 1, 2000 to January 5, 2000, and perform conditional processing on this date data.

Verification data


import pandas as pd
from pyspark.sql import functions as F

pdf = pd.DataFrame(pd.date_range(start='1/1/2000', periods=5), columns=['date'])
sdf = spark.createDataFrame(pdf, ['timestamp'])

TimeStampType and datetime operations

The operation using datetime for TimeStampType works normally.

TimeStampType and datetime operations


target_datetime = datetime.strptime('2000-01-03', '%Y-%m-%d')

print('== datetime(2000-01-03)')
sdf.where(F.col('timestamp') == datetime.strptime('2000-01-03', '%Y-%m-%d')).show()
print('>  datetime(2000-01-03)')
sdf.where(F.col('timestamp') >  datetime.strptime('2000-01-03', '%Y-%m-%d')).show()
print('>= datetime(2000-01-03)')
sdf.where(F.col('timestamp') >= datetime.strptime('2000-01-03', '%Y-%m-%d')).show()
print('<  datetime(2000-01-03)')
sdf.where(F.col('timestamp') <  datetime.strptime('2000-01-03', '%Y-%m-%d')).show()
print('<= datetime(2000-01-03)')
sdf.where(F.col('timestamp') <= datetime.strptime('2000-01-03', '%Y-%m-%d')).show()
print('between datetime(2000-01-02) and datetime(2000-01-04)')
sdf.where(F.col('timestamp').between(datetime.strptime('2000-01-02', '%Y-%m-%d'), datetime.strptime('2000-01-04', '%Y-%m-%d'))).show()

Output result


== datetime(2000-01-03)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-03 00:00:00|
+-------------------+

>  datetime(2000-01-03)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-04 00:00:00|
|2000-01-05 00:00:00|
+-------------------+

>= datetime(2000-01-03)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
|2000-01-05 00:00:00|
+-------------------+

<  datetime(2000-01-03)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
+-------------------+

<= datetime(2000-01-03)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
|2000-01-03 00:00:00|
+-------------------+

between datetime(2000-01-02) and datetime(2000-01-04)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-02 00:00:00|
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
+-------------------+

Operation of TimeStampType and string (datetime format)

Next, the result when string is given in datetime format (yyyy-mm-dd hh: mm: ss) is shown. The string seems to be cast implicitly, and the operation can be performed without any problem.

TimeStampType and string(datetime format)Calculation


print('== string(2000-01-03 00:00:00)')
sdf.where(F.col('timestamp') == '2000-01-03 00:00:00').show()
print('>  string(2000-01-03 00:00:00)')
sdf.where(F.col('timestamp') >  '2000-01-03 00:00:00').show()
print('>= string(2000-01-03 00:00:00)')
sdf.where(F.col('timestamp') >= '2000-01-03 00:00:00').show()
print('<  string(2000-01-03 00:00:00)')
sdf.where(F.col('timestamp') <  '2000-01-03 00:00:00').show()
print('<= string(2000-01-03 00:00:00)')
sdf.where(F.col('timestamp') <= '2000-01-03 00:00:00').show()
print('between string(2000-01-02 00:00:00) and string(2000-01-04 00:00:00)')
sdf.where(F.col('timestamp').between('2000-01-02 00:00:00', '2000-01-04 00:00:00')).show()

Output result


== string(2000-01-03 00:00:00)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-03 00:00:00|
+-------------------+

>  string(2000-01-03 00:00:00)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-04 00:00:00|
|2000-01-05 00:00:00|
+-------------------+

>= string(2000-01-03 00:00:00)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
|2000-01-05 00:00:00|
+-------------------+

<  string(2000-01-03 00:00:00)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
+-------------------+

<= string(2000-01-03 00:00:00)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
|2000-01-03 00:00:00|
+-------------------+

between string(2000-01-02 00:00:00) and string(2000-01-04 00:00:00)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-02 00:00:00|
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
+-------------------+

Operation of TimeStampType and string (date format)

Finally, the result when string is given in date format (yyyy-mm-dd) is shown. In this case, there are patterns that result in results that are not intuitively intended.

TimeStampType and string(date format)Calculation


print('== string(2000-01-03)')
sdf.where(F.col('timestamp') == '2000-01-03').show()
print('>  string(2000-01-03)')  #Unintended pattern
sdf.where(F.col('timestamp') >  '2000-01-03').show()
print('>= string(2000-01-03)')
sdf.where(F.col('timestamp') >= '2000-01-03').show()
print('<  string(2000-01-03)')
sdf.where(F.col('timestamp') <  '2000-01-03').show()
print('<= string(2000-01-03)')  #Unintended pattern
sdf.where(F.col('timestamp') <= '2000-01-03').show()
print('between string(2000-01-02) and string(2000-01-04)')  #Unintended pattern
sdf.where(F.col('timestamp').between('2000-01-02', '2000-01-04')).show()

Output result


== string(2000-01-03)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-03 00:00:00|
+-------------------+

>  string(2000-01-03)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
|2000-01-05 00:00:00|
+-------------------+

>= string(2000-01-03)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
|2000-01-05 00:00:00|
+-------------------+

<  string(2000-01-03)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
+-------------------+

<= string(2000-01-03)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
+-------------------+

between string(2000-01-02) and string(2000-01-04)
+-------------------+
|          timestamp|
+-------------------+
|2000-01-02 00:00:00|
|2000-01-03 00:00:00|
+-------------------+

About behavior

As a result of implicitly casting string to TimeStamp at the time of operation, TimeStamp (2000-01-01 00:00:00) and TimeStamp (string (2000-01-01 00:00:00)) are equivalent. And the result is like TimeStamp (2000-01-01 00:00:00) <TimeStamp (string (2000-01-01)).

From this, it can be imagined that the values for hours (00: 00: 00 in the above example) are not processed properly. (For strict specifications, you need to check the Scala source for the details of the operation)

(Extra edition) Cast from String Type to Time Stamp Type

By the way, when casting from StringType to TimeStampType, it seems that the cast is correct. An example is not shown here, but as with the above pattern, even if arithmetic processing with TimeStampType is performed, it operates normally (naturally because it is between TimeStampType types).

Cast from StringType to TimeStampType


df = spark.createDataFrame([('2000',), ('2000-01-01',), ('2000-01-01 00:00:00',) ], ['str'])
df = df.withColumn('timestamp', F.col('str').cast('timestamp'))
df.show()

Output result


+-------------------+-------------------+
|                str|          timestamp|
+-------------------+-------------------+
|               2000|2000-01-01 00:00:00|
|         2000-01-01|2000-01-01 00:00:00|
|2000-01-01 00:00:00|2000-01-01 00:00:00|
+-------------------+-------------------+

Recommended Posts

Precautions when operating with string for TmeStampType of PySpark
Summary of tools for operating Windows GUI with Python
Precautions when installing tensorflow with anaconda
Be careful of LANG for UnicodeEncodeError when printing Japanese with Python 3
Transactions when operating PostgreSQL with Psycopg2
Aim for content similarity with Pyspark
4th night of loop with for
Precautions when using six with Python 2.5
Precautions when solving DP problems with Python
Precautions when using for statements in pandas
Summary of snippets when developing with Go
The third night of the loop with for
Memorandum of Understanding when migrating with GORM
The second night of the loop with for
A memorandum of method often used when analyzing data with pandas (for beginners)
Precautions when dealing with control structures in Python 2.6
Precautions when using tf.keras.layers.TimeDistributed for tf.keras custom layer
Precautions when using google-cloud library with GAE / py
[Web development with Python] Precautions when saving cookies