When you want to use the data created by Spark's DataFrame in each Python module, you can use the toPandas () method to convert it to Pandas DataFrame, but a memory error often occurs at that time.
Through trial and error so that it can be stored in memory, I have summarized the ones that seem to be effective.
There seems to be a better way, so if you know it, please let me know!
Conversion by spark is affected by spark.driver.memory and spark.driver.maxResultSize, but in dask it is not, so it is easy to avoid the error.
Conversion using dask
import dask.dataframe as dd
df.write.parquet(parquet_path)
dask_df = dd.read_parquet(parquet_path)
pandas_df = dask_df.compute()
Change the data type of a variable to reduce the number of bytes.
Change data type
#For example, int32 type(4 bytes)Int8 type(1 byte)Convert to
dask_df = dask_dt.astype({k: 'int8' for k in dask_df.dtypes[dask_df.dtypes == 'int32'].index})
Recommended Posts