[PYTHON] What to do if you get a memory error when converting from PySparkDataFrame to PandasDataFrame

Introduction

When you want to use the data created by Spark's DataFrame in each Python module, you can use the toPandas () method to convert it to Pandas DataFrame, but a memory error often occurs at that time. Through trial and error so that it can be stored in memory, I have summarized the ones that seem to be effective.

There seems to be a better way, so if you know it, please let me know!

manner

Convert using dask

Conversion by spark is affected by spark.driver.memory and spark.driver.maxResultSize, but in dask it is not, so it is easy to avoid the error.

`Conversion using dask`


import dask.dataframe as dd

df.write.parquet(parquet_path)
dask_df = dd.read_parquet(parquet_path)
pandas_df = dask_df.compute()

Change data type

Change the data type of a variable to reduce the number of bytes.

Of course, reducing the number of bytes impairs the accuracy of subsequent operations.

`Change data type`


#For example, int32 type(4 bytes)Int8 type(1 byte)Convert to
dask_df = dask_dt.astype({k: 'int8' for k in dask_df.dtypes[dask_df.dtypes == 'int32'].index})

Recommended Posts

What to do if you get a memory error when converting from PySparkDataFrame to PandasDataFrame

What to do if you get an error when trying to load mnist

What to do if you get an error when installing Dlib (Ubuntu)

What to do if you get a must override `get_config` error when trying to model.save in Keras

What to do if you get a "No versions found" error in pipenv

What to do if you get an error when installing python with pyenv

What to do if you get a Permission denied (public key) error when trying to pull on Github

What to do if you get an "unknown service" error from your gRPC server

What to do if you get an OpenSSL error when installing Python 2 with pyenv

What to do if you get an error when importing matplotlib in Python (Mac)

What to do if you get an Import Error when importing matplotlib with Jupyter

What to do if you get the error ʻERR_FEATURE_UNAVAILABLE_ON_PLATFORM` when using ts-node-dev on Linux

What to do if you get an error when trying to send a message in tasks.loop () immediately after startup

What to do if you get a minus zero in Python

What to do if you get a UnicodeDecodeError with pip install

What to do if you get a Cannot retrieve metalink for repository error in yum

What to do if you get an error when running "certbot renew" in CakePHP environment

What to do if you get an Undefined error when trying to use pip with pyenv

What to do if you get a TypeError with numpy min, max

[Python] What to check when you get a Unicode Decode Error in Django

What to do when you want to receive files from a Windows client remotely

What to do if you get a "Wrong Python Platform" warning when using Python with the NetBeans IDE

[Python] What to do if you get a ModuleNotFoundError when importing pandas using Jupyter Notebook in Anaconda

What to do if you get `locale.Error: unsupported locale setting` when getting the day of the week from a date in Python

What to do when a Remove Error occurs when updating conda

What to do if you get "(35,'SSL connect error')" in pycurl (one of them)

What to do if you get "coverage unknown" in Coveralls

What to do if a 0xC0000005 error occurs in tf.train.start_queue_runners ()

What to do if you are told "Import Error: cannot import name'HTTPSHandler'" when building a virtual environment using virtualenv

What to do if you get the error RuntimeError: Python is not installed as a framework when trying to use matplitlib and pylab in Python 3.3

What to do if you get a Call with too many input arguments error at DoAndReturn in a golang test

What to do if you get the error "Error: opencv3: Does not support building both Python 2 and 3 wrappers" when installing openCV 3

What to do if you get Swagger-codegen in python and Import Error: No module named

What to do when you get "I can't see the site !!!!"

What to do if you get an error when vagrant up when you enable public_network or private_network on Vagrant + Arch Linux → Install netctl

What to do if you get angry with'vertices' must be a 2D list ... in matplotlib arrow

What to do if you get an error saying c compiler cannot create executables in configure

What to do if you get angry if you don't have libxml / xmlversion.h when installing lxml on CentOS

What to do if you get lost in file reference with FileNotFoundError

What to do if you get angry in TensorFlow v2 without attribute'app'

What to do if you get stuck during Anaconda installation on Linux

What to do if an error occurs when importing numpy with VScode

What to do if you get Could not fetch URL 443 with pip

What to do if fprintd requires a password when registering your fingerprint

What to do if you can't pipenv shell

What to do if you get "The session could not be opened" when installing CentOS on VirtualBox

What to do if you get an error like'Qstring' has already been set to version 1 using mne python

What to do if you get angry with "Value Error: unknown local: UTF-8" in python manage.py syncdb

[Django] What to do if an Integrity Error occurs when registering data from the management site to the database

What to do if you get the error Target WSGI script'/var/www/xxx/xxx.wsgi' cannot be loaded as python module

What to do if you get angry with swapon failed: Operation not permitted

What to do if Django can't load an image from a static folder

What to do if a Unicode Encode Error occurs in Sublime Text Python

What to do if you get "Python not configured." Using PyDev in Eclipse

If you get a long error when tabbing an interactive shell with Anaconda

What to do if a version error occurs in the selenium Chrome driver

What to do if an error occurs when loading a python project created with poetry into VS Code

No module named What to do if you get'libs.resources'

ModuleNotFoundError: No module What to do if you get'tensorflow.contrib'

What to do when gdal_merge creates a huge file

What to do if a UnicodeDecodeError occurs in pip