Is Spark2 really scary? The story of DataFrame.
The version is old because it was an article I wrote a while ago.
input.csv
x,y,z
1,,2
pyspark
>>> df = spark.read.csv("input.csv", header=True)
>>> df
DataFrame[x: string, y: string, z: string]
>>> df.show()
+---+----+---+
| x| y| z|
+---+----+---+
| 1|null| 2|
+---+----+---+
When you load such a file, the empty field will be null instead of "". In other words, saving as CSV makes "" and null indistinguishable. Please be careful. However, if the data is read by another Spark app, there is a way to save it with Parquet or Avro without using CSV.
Using the previous df
.
pyspark
>>> df.select(pyspark.sql.functions.length("y")).show()
+---------+
|length(y)|
+---------+
| null|
+---------+
#Recognize.
>>> df.select(pyspark.sql.functions.split("y", " ")).show()
+-----------+
|split(y, )|
+-----------+
| null|
+-----------+
#Well understand.
>>> df.select(pyspark.sql.functions.size(pyspark.sql.functions.split("y", " "))).show()
+-----------------+
|size(split(y, ))|
+-----------------+
| -1|
+-----------------+
# -1?Well...
>>> df.fillna("").show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| | 2|
+---+---+---+
#null""Replaced with.
>>> df.fillna("").select(pyspark.sql.functions.length("y")).show()
+---------+
|length(y)|
+---------+
| 0|
+---------+
# ""That's right.
>>> df.fillna("").select(pyspark.sql.functions.split("y", " ")).show()
+-----------+
|split(y, )|
+-----------+
| []|
+-----------+
#Sayana.
>>> df.fillna("").select(pyspark.sql.functions.size(pyspark.sql.functions.split("y", " "))).show()
+-----------------+
|size(split(y, ))|
+-----------------+
| 1|
+-----------------+
#It's not 0??
>>> df2 = spark.createDataFrame([[[]]], "arr: array<string>")
>>> df2
DataFrame[arr: array<string>]
>>> df2.show()
+---+
|arr|
+---+
| []|
+---+
>>> df2.select(pyspark.sql.functions.size("arr")).show()
+---------+
|size(arr)|
+---------+
| 0|
+---------+
#Why is that 1 and this is 0...
>>> df.fillna("").select(pyspark.sql.functions.split("y", " ")).collect()
[Row(split(y, )=[u''])]
# Oh...Certainly even in Python len("".split(" ")) == 1
#Does that mean I just got caught?...orz
In the output of show ()
, you can't distinguish between the empty array []
and the array ["]]
with one empty string ...
Surprisingly, these specifications are not properly written in Documentation. I was impatient.
Recommended Posts