[PYTHON] Sequential processing method when there is not enough memory in Keras

Overview

Use model.fit_generator. https://keras.io/ja/models/model/#fit_generator

environment

How to use

When the memory is low, the easiest solution is to reduce batch_size, but you may faint at the time of preprocessing such as padding, or the data is too huge to ride all at once. I think there are some things you can't do. In that case, when training the model, data loading → preprocessing must be performed for each batch.

With fit_generator, you can specify a function where you normally should specify data. You can also specify a function for validation data with validation_data =. Do the above in this function (generate_arrays). generate_arrays will be defined later.

batch_size = 8 #Data to be put in memory at once
model.fit_generator(
    generate_arrays(x_train, y_train, batch_size),
    validation_data=generate_arrays(x_test, y_test, batch_size),
    epochs=3,
    steps_per_epoch=len(x_train) // batch_size,
    validation_steps=len(x_test) // batch_size
)

Step_per_epoch determines how many steps to execute in one epoch. If you don't decide this, it won't end. Normally, you want to see all the data, so divide the total length of the data by the batch size. validation_steps is about validation data.

The contents of generate_arrays are as below.

def generate_arrays(x, y, batch_size=32):
    i = 0
    while True:
        batch_df = x[i * batch_size : (i + 1) * batch_size]
        batch_y = y[i * batch_size : (i + 1) * batch_size]
        if (i + 1) * batch_size >= len(x):
            i = 0 #i need to be reset
        else:
            i += 1
        yield process_data(batch_df, batch_y)

#An example of preprocessing
def process_data(batch_df, batch_y):
    arr = {}   
    #Pad to align the length of history
    for c in batch_df.columns:
        arr[c] = pad_sequences(batch_df[c], dtype='float32', maxlen=MAX_RES_TOKENS)
    
    return (arr, batch_y)

This time, the data itself was in memory, but it was a pattern that could not be preprocessed at once. After receiving x and y, first cut out the DataFrame to batch size. A python yield is similar to a return, but it runs again without ending the loop. The following site will be helpful. http://ailaby.com/yield/ The i in generate_arrays will not be reset when you go to the next epoch, so if you try to see an index that exceeds the data size, you will have to manually reset it to 0.

The data to yield is a tuple of (batch_x, batch_y). If you have named the inputs and outputs of your model

(
    {
        'in1': batch_in1, 'in2': batch_in2
    },
    {
        'out1': batch_out1, 'out2': batch_out2
    }
)

It's okay if you yield in the form of.

When the data does not get on at once

If there are a large number of csv files etc. with serial numbers, separate the verification data in advance, etc.

def generate_arrays(file_path, max_file_count):
    i = 0
    while True:
        batch_df = pd.read_csv(f"{file_path}/{i}.csv")
        batch_y = batch_df['answer']
        if i >= max_file_count:
            i = 0
        else:
            i += 1
        yield process_data(batch_df, batch_y)

If so, it's okay. If the number of data changes for each file, read csv and then cut out further.

Reference site

It was helpful, thank you. This article just simplifies the content of ↓. Keras and the magic of the generator

Recommended Posts

Sequential processing method when there is not enough memory in Keras
When searching is not working in GAE's Datastore
When "ERROR: HTTP is not supported." Is displayed in mpsyt
Memory is not allocated when ndarray is initialized with numpy.zeros ()
Unfortunately there is no sense of unity in the where method
[Golang] "package exec is not in GOROOT" when executing the test
Processing order when chaining when in PySpark
Is there a special in scipy? ??
Command is not found in sudo
There is no switch in python
Bug that says'val_loss' is not found when using Early Stopping in pytorch-lightning (0.5.3.2)