Is Spark2 really scary? The story of DataFrame.

Development environment

The version is old because it was an article I wrote a while ago.

Python 2.7.15
Spark 2.1.2

An example of being addicted to reading CSV

`input.csv`


x,y,z
1,,2

`pyspark`


>>> df = spark.read.csv("input.csv", header=True)
>>> df
DataFrame[x: string, y: string, z: string]
>>> df.show()
+---+----+---+
|  x|   y|  z|
+---+----+---+
|  1|null|  2|
+---+----+---+

When you load such a file, the empty field will be null instead of "". In other words, saving as CSV makes "" and null indistinguishable. Please be careful. However, if the data is read by another Spark app, there is a way to save it with Parquet or Avro without using CSV.

An example of being addicted to string functions

Using the previous df.

`pyspark`


>>> df.select(pyspark.sql.functions.length("y")).show()
+---------+
|length(y)|
+---------+
|     null|
+---------+
#Recognize.

>>> df.select(pyspark.sql.functions.split("y", " ")).show()
+-----------+
|split(y,  )|
+-----------+
|       null|
+-----------+
#Well understand.

>>> df.select(pyspark.sql.functions.size(pyspark.sql.functions.split("y", " "))).show()
+-----------------+
|size(split(y,  ))|
+-----------------+
|               -1|
+-----------------+
# -1?Well...

>>> df.fillna("").show()
+---+---+---+
|  x|  y|  z|
+---+---+---+
|  1|   |  2|
+---+---+---+
#null""Replaced with.

>>> df.fillna("").select(pyspark.sql.functions.length("y")).show()
+---------+
|length(y)|
+---------+
|        0|
+---------+
# ""That's right.

>>> df.fillna("").select(pyspark.sql.functions.split("y", " ")).show()
+-----------+
|split(y,  )|
+-----------+
|         []|
+-----------+
#Sayana.

>>> df.fillna("").select(pyspark.sql.functions.size(pyspark.sql.functions.split("y", " "))).show()
+-----------------+
|size(split(y,  ))|
+-----------------+
|                1|
+-----------------+
#It's not 0??

>>> df2 = spark.createDataFrame([[[]]], "arr: array<string>")
>>> df2
DataFrame[arr: array<string>]
>>> df2.show()
+---+
|arr|
+---+
| []|
+---+

>>> df2.select(pyspark.sql.functions.size("arr")).show()
+---------+
|size(arr)|
+---------+
|        0|
+---------+
#Why is that 1 and this is 0...

>>> df.fillna("").select(pyspark.sql.functions.split("y", " ")).collect()
[Row(split(y,  )=[u''])]
# Oh...Certainly even in Python len("".split(" ")) == 1
#Does that mean I just got caught?...orz

In the output of show (), you can't distinguish between the empty array [] and the array ["]] with one empty string ... Surprisingly, these specifications are not properly written in Documentation. I was impatient.

[PYTHON] [Spark] I'm addicted to trapping "", null and [] in DataFrame

Development environment

An example of being addicted to reading CSV

input.csv

pyspark

An example of being addicted to string functions

pyspark

`input.csv`

`pyspark`

`pyspark`