[PYTHON] Ingenuity to handle data with Pandas in a memory-saving manner

If you want to process a large amount of data with the memory of your computer, but you want to survive with pandas, here are some memory-saving methods and related ideas that you can do as a final step. If you use it consciously on a regular basis, you can analyze without wasting computational resources, so I think there are various merits. However, if there is a limit and it does not work, it is recommended to increase the memory or utilize cloud services such as AWS and GCP in the first place.

Validation data

This time, I will use the famous Titanic Survivor Prediction Data because it is not realistic to write only the method. The amount of data is not large at all, but please forgive me for that. If you are only concerned about how to do it, you can skip this part.

Analytical environment

OS: windows 10 Home in a cheap laptop on the market Environment: Jupyter notebook launched by mounting Kaggle Docker on the above OS python:Python 3.6.6 :: Anaconda, Inc. The folder structure at the time of analysis is as follows, the folder where the notebook is placed and the folder where the usage data is placed. image.png

First, place the downloaded Titanic data in the input folder and prepare to read the data. Execute the following command in jupyter notebook.

#Library import
import os, gc, pickle, datetime, sys
import numpy as np
import pandas as pd

#Usage data path and data confirmation
INPUTPATH = '../input'

#Execution result
#> ['gender_submission.csv', 'test.csv', 'train.csv']

Then I was able to confirm the data. Load train.csv.

df = pd.read_csv(f'{INPUTPATH}/train.csv')

Change data type

Unless there is a big demand for maximum, minimum or decimal precision of data, you can reduce memory consumption by changing from double precision to single precision data type. So let's take verification data as an example. First, check the loaded data type of train.csv. Unless otherwise specified, integer values are read as int64 and decimals are read as float64.

#Output result below
#PassengerId      int64
#Survived         int64
#Pclass           int64
#Name            object
#Sex             object
#Age            float64
#SibSp            int64
#Parch            int64
#Ticket          object
#Fare           float64
#Cabin           object
#Embarked        object
#dtype: object

Let's change the data type immediately. .astype() The easiest way is to use .astype (). For example, let's change the Titanic ticket price Fare from float64 to float32. As a result, the data size is halved.

dfare1 = df.Fare.nbytes
print('The data size of Fare when float64 is:{:.2f} KB'.format(dfare1/1024))
dfare2 = df.Fare.astype('float32').nbytes
print('The data size of Fare at float32 is:{:.2f} KB'.format(dfare2/1024))
print('By changing the data type{:.2f}%I was able to reduce the file size'.format(100*(1-dfare2/dfare1)))

##Output result below
#The data size of Fare when float64 is: 6.96 KB
#The data size of Fare at float32 is: 3.48 KB
#50 by changing the data type.00%I was able to reduce the file size

However, when changing the data type, it is necessary to make sure that the accuracy of the maximum, minimum and decimal points of the original data is not affected analytically. For example, in the case of float type, the influence on analysis can be ignored even if the number of digits after the decimal point decreases, and in the case of int type, the range of integers should be firmly accommodated even if the data type is changed. is needed. If not, it is an extreme example, but if you change the data type even though the original maximum value of the column called PassengerId is 891 as shown below, the [maximum value that can be expressed by the data type](https: //) Please note that it will stick to docs.scipy.org/doc/numpy-1.10.0/user/basics.types.html) and the numbers themselves will change. Personally, unless you are dealing with very small decimals or large numbers, first change double precision (64bit) to single precision (32bit). If this does not help, look at the number range and decimal point precision requirements individually and make some corrections. Alternatively, you can make a judgment around here by directly using the function described later.

#Output result

#Output result

Specify the data type at once with read_csv

It is convenient to use read_csv of pandas because there is an option called dtype that allows you to specify the data type of each column when reading data. However, to do so, you need to create a dictionary that defines the column name and the corresponding data type in advance. Therefore, if you read the data once and check that changing the data type with df.describe () has no effect, the result of df.dtypes is made into a dictionary. Just replace 64 with 32, and replace dtype with np.dtype (when displaying, it is displayed as dtype, but when inputting, it must be np.dtype), and the dictionary can be created relatively easily. can make.

dtype_dict ={'PassengerId': np.dtype('int32'),
 'Survived': np.dtype('int32'),
 'Pclass': np.dtype('int32'),
 'Name': np.dtype('O'),
 'Sex': np.dtype('O'),
 'Age': np.dtype('float32'),
 'SibSp': np.dtype('int32'),
 'Parch': np.dtype('int32'),
 'Ticket': np.dtype('O'),
 'Fare': np.dtype('float32'),
 'Cabin': np.dtype('O'),
 'Embarked': np.dtype('O')}
df = pd.read_csv(f'{INPUTPATH}/train.csv',dtype=dtype_dict)
##Output result
#PassengerId      int32
#Survived         int32
#Pclass           int32
#Name            object
#Sex             object
#Age            float32
#SibSp            int32
#Parch            int32
#Ticket          object
#Fare           float32
#Cabin           object
#Embarked        object
#dtype: object

Use convenient functions

You can also use the convenience functions used in the Kaggle Competition about a year ago. I also participated in this competition, and I am grateful to have used it. It is also introduced in this article. Details can be found in this article, but they make decisions based on the minimum and maximum values of the data. The code below is partly customized.

――First of all, the double precision is changed compared to the numeric type whose data type is not the object type. --Also, if the data type includes datetime or category type, it is skipped to avoid the error. --Finally, the object type is changed to the category type.

#The following import is required when using the function
from pandas.api.types import is_datetime64_any_dtype as is_datetime
from pandas.api.types import is_categorical_dtype
#Function definition
def reduce_mem_usage(df, use_float16=False):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    for col in df.columns:
        if is_datetime(df[col]) or is_categorical_dtype(df[col]):
            # skip datetime type or categorical type
        col_type = df[col].dtype
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
                if use_float16 and c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                    df[col] = df[col].astype(np.float64)
            #df[col] = df[col].astype('category')
    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

Let's use the function. In this case, we were able to reduce the memory usage of the loaded train.csv by about 44%.

df = reduce_mem_usage(df, use_float16=False)
#Output result
#Memory usage of dataframe is 0.08 MB
#Memory usage after optimization is: 0.05 MB
#Decreased by 43.7%

Change to Category type

Changing to the category type of the last point of the customized part of the reduce_mem_usage () function introduced above also saves memory. Even if you try changing the gender of the Titanic data from the object type to the category type, you can reduce it by about half in this case.

dsex1 = df.Sex.nbytes
print('The data size of Fare at the time of object is:{:.2f} KB'.format(dsex1/1024))
dsex2 = df.Fare.astype('category').nbytes
print('The data size of Fare at the time of category is:{:.2f} KB'.format(dsex2/1024))
print('By changing the data type{:.2f}%I was able to reduce the file size'.format(100*(1-dsex2/dsex1)))
##Output result
#The data size of Fare at the time of object is: 6.96 KB
#The data size of Fare in category is: 3.68 KB
#By changing the data type 47.17%I was able to reduce the file size

However, there are some precautions when changing to the category type, and if there are missing values, additional processing will be required. For example, a column called Cabin in a passenger cabin contains missing values.

##Output result
#PassengerId      0
#Survived         0
#Pclass           0
#Name             0
#Sex              0
#Age            177
#SibSp            0
#Parch            0
#Ticket           0
#Fare             0
#Cabin          687
#Embarked         2
#dtype: int64

Suppose you change this to a category type and then fill in the missing values. Then, I get an error that there is no category corresponding to "null" that I am trying to replace with "ValueError: fill value must be in categories".

df.Cabin = df.Cabin.astype('category')
df.Cabin = df.Cabin.fillna('null')

image.png You don't have to reduce the memory drastically, or if the object type is a decimal number of data, you may just use the object type or omit the conversion part of the reduce_mem_usage () function to the category type. Also, it is not always possible to reduce the memory by making it a category type. On the other hand, if you want to convert the object type to the category type and reduce the memory thoroughly, there are two ways to deal with the error for missing value completion. Generally speaking, if you complete the missing values in advance and then convert to the category type as in method 1, no error will occur, but if you want to complete missing values as needed during analysis, you can use step 2. It seems that it is effective to respond by one person.

  1. Save missing values and then convert to category type
  2. If it has been converted to category type in advance, add a new category corresponding to null and complete the missing value. Use the pandas add_categories (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.add_categories.html).
#1.the method of
df.Cabin = df.Cabin.fillna('null')
df.Cabin = df.Cabin.astype('category')

#2.the method of
df.Cabin = df.Cabin.cat.add_categories('null').fillna('null')

Use Sparse Data Structure

Use pandas' Sparse Data Structure (https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html) to reduce memory when encoding categorical variables into dummy variables This can be expected to have a memory reduction effect. If the number of columns of a dummy variable consisting of only 0 and 1 data increases, the memory reduction effect can be achieved by using Sparse Data Structure. In Sparse Data Structure, the position of some data with 1 is recorded, and the other 0 parts are compressed without being saved as data. However, there are some inconveniences caused by using Sparse Data Structure, so it will be described later.

Try dummy encoding normally

Before that, let's dare to select the categorical variables of Titanic data with a large number of levels (high cardinality) and see the effect.

for var in df.dtypes[df.dtypes =='object'].index.tolist():
    print('Unique level of '+var+' is {:}'.format(len(df[var].unique())))
#Output result
#Unique level of Name is 891
#Unique level of Sex is 2
#Unique level of Ticket is 681
#Unique level of Cabin is 148
Unique level of Embarked is 4

As expected, Name is a name, so let's exclude it from the target (laugh) Let's try to encode Ticket, Cabin, Embarked. Dummy encoding is easy with pandas' get_dummies () [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html). Even though the number of rows of data is 891, the number of columns is 834, which is terribly sparse data. The memory usage is about 726KB when checked with df_dummies.info (). (You can also get the data size in KB with sys.getsizeof (df_dummies) /1024 and df_dummies.memory_usage (). Sum () / 1024)

dummy_list = ['Ticket', 'Cabin', 'Embarked']
df_dummies = pd.get_dummies(df[dummy_list], dummy_na=True, sparse=False, prefix = dummy_list)

##Output result
#(891, 834)

##Output result
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 891 entries, 0 to 890
#Columns: 834 entries, Ticket_110152 to Embarked_nan
#dtypes: uint8(834)
#memory usage: 725.8 KB

Encode to Sparse Data Structure

Next, try using the Sparse Data Structure to reduce memory. get_dummies () has an option called Sparse, which is normally False, but change this to True. As a result, we were able to reduce memory by about 98% from 726KB to 13KB.

df_dummies2 = pd.get_dummies(df[dummy_list], dummy_na=True, sparse=True, prefix = dummy_list)
##Output result
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 891 entries, 0 to 890
#Columns: 834 entries, Ticket_110152 to Embarked_nan
#dtypes: Sparse[uint8, 0](834)
#memory usage: 13.2 KB

By using Sparse Data Structure, I was able to compress the memory significantly. However, there is something to be careful about. The point is that compression can make pandas methods that were used in normal data structures unusable. For example, suppose you have a dummy variable called Ticket_110152 and you want to sum the number of 1s in all the data. In the case of a normal DataFrame, .sum () is sufficient, but in the case of Sparse Data Structure, an error occurs because the data is compressed.

#How to make it normally
#Output result

#Sparse Data Structure (I created df with Sparse Data Structure_dummies2)
#Output result
#TypeError: sum() got an unexpected keyword argument 'min_count'

To avoid such errors, it is better to revert from the Sparse Data Structure to the original data structure. This time it is python3.6, so I will restore it with np.asarray, but since ptyhon3.7 it can be made simpler with a method called .to_dense ().

#Output result
#python 3.After 7
#df_dummies2.Ticket_110152.to_dense().sum()But you should be able to go


I tried to summarize the ideas that can be done when handling pandas dataframes with memory saving and the points to note at that time.

--You can reduce the memory by changing the data type from double precision (64bit) to single precision or less (~ 32bit), but you need to pay attention to the precision of the fraction and the range of the data value when selecting the data type. ――When you have a sparse data structure such as a dummy variable, you can use Sparse Data Structure to compress the data significantly. On the other hand, compressing the data makes it impossible to use the pandas methods that should normally be used, so it is necessary to devise a way to restore the dense data structure.

Thank you for visiting our website. If there is another better way, please leave a comment. It's easy, but I've uploaded the data and code to github. I also posted a sample of group by when using Sparse Data Structure.

Recommended Posts

Ingenuity to handle data with Pandas in a memory-saving manner
<Pandas> How to handle time series data in a pivot table
Try converting to tidy data with pandas
Working with 3D data structures in pandas
Try to aggregate doujin music data with pandas
Delete data in a pattern with Redis Cluster
Handle integer types with missing values in Pandas
Make holiday data into a data frame with pandas
Convert 202003 to 2020-03 with pandas
Data visualization with pandas
Data manipulation with Pandas!
Shuffle data with pandas
I want to give a group_id to a pandas data frame
Train MNIST data with a neural network in PyTorch
I want to transition with a button in flask
How to convert horizontally held data to vertically held data with pandas
How to access with cache when reading_json in pandas
How to extract non-missing value nan data with pandas
I want to work with a robot in python.
[Python] A memo to write CSV vertically with Pandas
Convert numeric variables to categorical with thresholds in pandas
How to extract non-missing value nan data with pandas
How to use fixture in Django to populate sample data associated with a user model
Save pandas data in Excel format to data assets with Cloud Pak for Data (Watson Studio)
I tried scraping food recall information with Python to create a pandas data frame
Handle Ambient data in Python
How to output a document in pdf format with Sphinx
Put the lists together in pandas to make a DataFrame
A handy function to add a column anywhere in a Pandas DataFrame
A collection of methods used when aggregating data with pandas
How to get an overview of your data in Pandas
How to handle data frames
Simultaneously input specific data to a specific sheet in many excels
How to create dataframes and mess with elements in pandas
Data science companion in python, how to specify elements in pandas
Data processing tips with Pandas
The minimum methods to remember when aggregating data in Pandas
If you want to become a data scientist, start with Kaggle
[Stock price analysis] Learn pandas with Nikkei 225 (004: Change read data to Nikkei 225)
Organize individual purchase data in a table with scikit-learn's MultiLabel Binarizer
Data engineers learn DevOps with a view to MLOps. ① Getting started
How to create a heatmap with an arbitrary domain in Python
[Pandas] I tried to analyze sales data with Python [For beginners]
How to create a large amount of test data in MySQL? ??
I will explain how to use Pandas in an easy-to-understand manner.
How to handle session in SQLAlchemy
How to deal with imbalanced data
Handle NetCDF format data in Python
How to deal with imbalanced data
Handle requests in a separate process
Versatile data plotting with pandas + matplotlib
Try to put data in MongoDB
How to Data Augmentation with PyTorch
Draw a graph with pandas + XlsxWriter
I want to do ○○ with Pandas
How to handle Japanese in Python
Handle various date formats with pandas
[Small story] How to save matplotlib graphs in a batch with Jupyter
SE, a beginner in data analysis, learns with the data science unit vol.1
How to get a specific column name and index name in pandas DataFrame
(Matplotlib) I want to draw a graph with a size specified in pixels