When I was writing code to chain when
in PySpark
"Does this take precedence over the previously written when
, just like SQL? "
"Because it is a method chain, will it be overwritten by the when
written at the end? "
I was worried, so I actually wrote the verification code and examined it.
df = spark.createDataFrame([(1,),(2,),(3,)], schema=('val',))
display(df)
val |
---|
1 |
2 |
3 |
#Registered as a temporary table to touch from Spark SQL
df.registerTempTable('tmp')
SELECT
val,
CASE
WHEN val <= 1 THEN 'label_1'
WHEN val <= 2 THEN 'label_2'
ELSE 'label_3'
END AS label
FROM tmp
val | label |
---|---|
1 | label_1 |
2 | label_2 |
3 | label_3 |
In the case of SQL, of course, the condition of WHEN
written earlier takes precedence.
from pyspark.sql import functions as F
df_label = df.withColumn('label',
F.when(F.col('val') <= 1, 'label_1')
.when(F.col('val') <= 2, 'label_2')
.otherwise('label_3')
)
display(df_label)
val | label |
---|---|
1 | label_1 |
2 | label_2 |
3 | label_3 |
Even when when
is chained in PySpark, it seems that the condition of when
written earlier has priority as in Spark SQL.
Recommended Posts