[PYTHON] Processing order when chaining when in PySpark

When I was writing code to chain when in PySpark

"Does this take precedence over the previously written when, just like SQL? " "Because it is a method chain, will it be overwritten by the when written at the end? "

I was worried, so I actually wrote the verification code and examined it.

Dummy data

df = spark.createDataFrame([(1,),(2,),(3,)], schema=('val',))
display(df)
val
1
2
3

For Spark SQL

#Registered as a temporary table to touch from Spark SQL
df.registerTempTable('tmp')
SELECT
  val,
  CASE
    WHEN val <= 1 THEN 'label_1'
    WHEN val <= 2 THEN 'label_2'
    ELSE 'label_3'
  END AS label
FROM tmp
val label
1 label_1
2 label_2
3 label_3

In the case of SQL, of course, the condition of WHEN written earlier takes precedence.

For PySpark

from pyspark.sql import functions as F

df_label = df.withColumn('label',
    F.when(F.col('val') <= 1, 'label_1')
     .when(F.col('val') <= 2, 'label_2')
     .otherwise('label_3')
)
display(df_label)
val label
1 label_1
2 label_2
3 label_3

Even when when is chained in PySpark, it seems that the condition of when written earlier has priority as in Spark SQL.

Recommended Posts

Processing order when chaining when in PySpark
Iterative (recursive) processing with tkinter (displayed in order)
File processing in Python
Multithreaded processing in python
I get a java.util.regex.PatternSyntaxException when splitting a string in PySpark
Text processing in Python
Queue processing in Python
Things to keep in mind when processing strings in Python2
Things to keep in mind when processing strings in Python3
Natural order in python
UTF8 text processing in python
Asynchronous processing (threading) in python
Attention when os.mkdir in Python
Image Processing Collection in Python
Celery asynchronous processing in Flask
Using Python mode in Processing
Sequential processing method when there is not enough memory in Keras