[PYTHON] [Spark Data Frame] Change a column from horizontal to vertical (Scala)

Introduction

In SparkDf, when changing from horizontal holding to vertical holding for a column with DF, it took some time, so I will record it as a memorandum. For the time being, the source code is Scala, but it shouldn't change that much with Python.

Data frame manipulation

Creating a test DataFrame

testDf.scala


val testDf = Seq(
    (1,"a",30,20),
    (2,"a",40,30),
    (3,"c",50,40),
    (4,"d",60,50),
    (2,"d",40,70),
    (2,"d",20,10),
    (1,"a",60,90)
   )
   .toDF("Column1", "Column2", "Column3","Column4")

/*
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
|      1|      a|     30|     20|
|      2|      a|     40|     30|
|      3|      c|     50|     40|
|      4|      d|     60|     50|
|      2|      d|     40|     70|
|      2|      d|     20|     10|
|      1|      a|     60|     90|
+-------+-------+-------+-------+
*/


Vertical holding → Horizontal holding

Vertical holding → horizontal holding is easy, and you can easily do it with pivot. When using it, you need to use three methods like "df.groupby (). Pivot (). Count ()".

pivot.scala


val test2Df = testDf
.groupBy("Column1") 
.pivot("Column2") //Column you want to hold horizontally
.count() //Aggregate function
.na.fill(0) //Zero padding

/*
+-------+---+---+---+
|Column1|  a|  c|  d|
+-------+---+---+---+
|      1|  2|  0|  0|
|      3|  0|  1|  0|
|      4|  0|  0|  1|
|      2|  1|  0|  2|
+-------+---+---+---+
*/

Horizontal holding → Vertical holding

In the main subject, if you have test2Df and want to hold it vertically for some reason. Horizontal holding → Vertical holding does not have a function that can be done in one shot like pivot, so is it a little troublesome? Add. The ones used are "selectExpr ()" and "stack". The source code is as follows.

unpivot.scala


val test3Df = test2Df
    .selectExpr(
        "Column1",//Target column that serves as a reference for vertical holding
        "stack(3,'a',a,'c',c,'d',d)".//Original to be held vertically
    )
/*
+-------+----+----+
|Column1|col0|col1|
+-------+----+----+
|      1|   a|   2|
|      1|   c|   0|
|      1|   d|   0|
|      3|   a|   0|
|      3|   c|   1|
|      3|   d|   0|
|      4|   a|   0|
|      4|   c|   0|
|      4|   d|   1|
|      2|   a|   1|
|      2|   c|   0|
|      2|   d|   2|
+-------+----+----+
*/

It can be transformed like this. First, specify the target column that will be held vertically, and then use stack () to specify the source that will be held vertically. In the above case, a, c, and d are vertically held for Column1. 3 in stack () refers to the number of vertical holdings (in this case, 3 of a, c, d), "a" "c" "d" refers to the name, and acd refers to the original numerical value of the column name. (You can also simply put "name1" or a string). Also, the column name is col0 by default, so if you want to change it, you can change it by using **. WithColumnRenamed ("col0", "○○") **.

After that, I think that if you use join well, you can change from vertical holding to horizontal holding.

Reference link

How to Pivot and Unpivot a Spark DataFrame

Recommended Posts

[Spark Data Frame] Change a column from horizontal to vertical (Scala)
Change the data frame of pandas purchase data (id x product) to a dictionary
Generate a vertical image of a novel from text data
A memo that reads data from dashDB with Python & Spark
Create a data frame from the acquired boat race text data
[Linux] Copy data from Linux to Windows with a shell script
[Python] How to get & change rows / columns / values from a table.
From Elasticsearch installation to data entry
A memorandum to change to Manjaro Linux
Did not change from Python 2 to 3
Create multiple line charts from a data frame at once using Matplotlib
Share device events from a container on a Catalyst switch to Cisco Spark