Introduction

If you somehow use the pandas data frame in front of the fucking big matrix, you will have plenty of memory out and it will take hours to process. Therefore, I will keep a memorandum of points that I personally take care of when handling data frames lighter and faster.

The following articles are documents of my own experiences and memos, so I have not made a detailed comparison (how fast).

Lottery

--Type optimization --Specify the type with read_csv --do not use for

bonus --Summary

Type optimization

kaggle This function has never been seen in a table competition. Optimize the mold size to prevent memory out and speed up.

def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of DataFrame is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

Specify the type with `read_csv`

By specifying the type here, the time to specify the type can be omitted, and data can be read faster. Not very suitable for data with too large columns (difficult).

df = pd.read_csv('./train.csv', dtype={'a': int, 'b': float, 'c': str})

do not use for

python is slow because the compiler runs for each process for each for. Therefore, for is not used as much as possible, and map and ʻapply` are used.

By the way, I use these functions properly as follows, but I do not know what is actually different (who tells me)

map

When adapting a dictionary type to a column

d = {'hoge':'Hoge', 'fuga':'Fuga', 'piyo':'Piyo'}

df['tmp'] = df['tmp'].map(d)

apply

When adapting a function to data targeted by `groupby`

df['rolling_mean'] = df.groupby([id])['timeseries'].apply(lambda x:x.rolling(12, 1).mean())

When you want to process data for a row

def cos_sim(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

df = df.apply(lambda x: cos_sim(v1, x), axis=1) # return pd.Series

Both of them can be processed for columns, so you can use them according to your mood.

By the way, the temperature feeling of for in personal python 100 iter → It ’s okay. 1000 iter → Hmm? 10000 iter → Hey ... 100000 iter → Are you sane?

bonus

There is something like "I processed each category, but I can't use map, ʻapply` ... It takes a long time to try and I can't wait ...".

year_list = [2010, 2011, 2012, 2013]

for y in year_list:
    df_y = df[df['year']==y]

    '''processing'''

    df_y_executed.to_csv('./savefolder/save_{y}.csv'.format(y=y))

Currently, pandas is not parallelized by physical cores, so there are surplus resources, but waiting time is wasteful. Therefore, it is forcibly parallelized by the physical core.

In such a case, the following method is used in my laboratory.

Divide the category and save it in csv (in the above example, divide it like [2010, 2011], [2012, 2013] and save it in csv)
Start another jupyter and run the process with two jupyters

This makes it possible to process using two physical cores.

Summary

There is something called Dask in Chimata, and it seems that data larger than memory can be processed in parallel. I heard that it is highly compatible with pandas and feels good, so I would like to try it.

[PYTHON] My pandas too late?