[PYTHON] What to do if you get a memory error when converting from PySparkDataFrame to PandasDataFrame

Introduction

When you want to use the data created by Spark's DataFrame in each Python module, you can use the toPandas () method to convert it to Pandas DataFrame, but a memory error often occurs at that time. Through trial and error so that it can be stored in memory, I have summarized the ones that seem to be effective.

There seems to be a better way, so if you know it, please let me know!

manner

Convert using dask

Conversion by spark is affected by spark.driver.memory and spark.driver.maxResultSize, but in dask it is not, so it is easy to avoid the error.

Conversion using dask


import dask.dataframe as dd

df.write.parquet(parquet_path)
dask_df = dd.read_parquet(parquet_path)
pandas_df = dask_df.compute()

Change data type

Change the data type of a variable to reduce the number of bytes.

Change data type


#For example, int32 type(4 bytes)Int8 type(1 byte)Convert to
dask_df = dask_dt.astype({k: 'int8' for k in dask_df.dtypes[dask_df.dtypes == 'int32'].index})

Recommended Posts

What to do if you get a memory error when converting from PySparkDataFrame to PandasDataFrame
What to do if you get an error when trying to load mnist
What to do if you get an error when installing Dlib (Ubuntu)
What to do if you get a must override `get_config` error when trying to model.save in Keras
What to do if you get a "No versions found" error in pipenv
What to do if you get an error when installing python with pyenv
What to do if you get a Permission denied (public key) error when trying to pull on Github
What to do if you get an "unknown service" error from your gRPC server
What to do if you get an OpenSSL error when installing Python 2 with pyenv
What to do if you get an error when importing matplotlib in Python (Mac)
What to do if you get an Import Error when importing matplotlib with Jupyter
What to do if you get the error ʻERR_FEATURE_UNAVAILABLE_ON_PLATFORM` when using ts-node-dev on Linux
What to do if you get an error when trying to send a message in tasks.loop () immediately after startup
What to do if you get a minus zero in Python
What to do if you get a UnicodeDecodeError with pip install
What to do if you get a Cannot retrieve metalink for repository error in yum
What to do if you get an error when running "certbot renew" in CakePHP environment
What to do if you get an Undefined error when trying to use pip with pyenv
What to do if you get a TypeError with numpy min, max
[Python] What to check when you get a Unicode Decode Error in Django
What to do when you want to receive files from a Windows client remotely
What to do if you get a "Wrong Python Platform" warning when using Python with the NetBeans IDE
[Python] What to do if you get a ModuleNotFoundError when importing pandas using Jupyter Notebook in Anaconda
What to do if you get `locale.Error: unsupported locale setting` when getting the day of the week from a date in Python
What to do when a Remove Error occurs when updating conda
What to do if you get "(35,'SSL connect error')" in pycurl (one of them)
What to do if you get "coverage unknown" in Coveralls
What to do if a 0xC0000005 error occurs in tf.train.start_queue_runners ()
What to do if you are told "Import Error: cannot import name'HTTPSHandler'" when building a virtual environment using virtualenv
What to do if you get the error RuntimeError: Python is not installed as a framework when trying to use matplitlib and pylab in Python 3.3
What to do if you get a Call with too many input arguments error at DoAndReturn in a golang test
What to do if you get the error "Error: opencv3: Does not support building both Python 2 and 3 wrappers" when installing openCV 3
What to do if you get Swagger-codegen in python and Import Error: No module named
What to do when you get "I can't see the site !!!!"
What to do if you get an error when vagrant up when you enable public_network or private_network on Vagrant + Arch Linux → Install netctl
What to do if you get angry with'vertices' must be a 2D list ... in matplotlib arrow
What to do if you get an error saying c compiler cannot create executables in configure
What to do if you get angry if you don't have libxml / xmlversion.h when installing lxml on CentOS
What to do if you get lost in file reference with FileNotFoundError
What to do if you get angry in TensorFlow v2 without attribute'app'
What to do if you get stuck during Anaconda installation on Linux
What to do if an error occurs when importing numpy with VScode
What to do if you get Could not fetch URL 443 with pip
What to do if fprintd requires a password when registering your fingerprint
What to do if you can't pipenv shell
What to do if you get "The session could not be opened" when installing CentOS on VirtualBox
What to do if you get an error like'Qstring' has already been set to version 1 using mne python
What to do if you get angry with "Value Error: unknown local: UTF-8" in python manage.py syncdb
[Django] What to do if an Integrity Error occurs when registering data from the management site to the database
What to do if you get the error Target WSGI script'/var/www/xxx/xxx.wsgi' cannot be loaded as python module
What to do if you get angry with swapon failed: Operation not permitted
What to do if Django can't load an image from a static folder
What to do if a Unicode Encode Error occurs in Sublime Text Python
What to do if you get "Python not configured." Using PyDev in Eclipse
If you get a long error when tabbing an interactive shell with Anaconda
What to do if a version error occurs in the selenium Chrome driver
What to do if an error occurs when loading a python project created with poetry into VS Code
No module named What to do if you get'libs.resources'
ModuleNotFoundError: No module What to do if you get'tensorflow.contrib'
What to do when gdal_merge creates a huge file
What to do if a UnicodeDecodeError occurs in pip