Introduction

This is an updated version of the previous article. How to train a large amount of data with TFRecord & DataSet in TensorFlow & Keras-Qiita

One thing I want to do is "I want an efficient way to learn huge data that doesn't fit in memory." It's a method that allows CPU data reading and GPU calculation to be processed in parallel. We will use the DataSet API to efficiently learn from data saved in a specific format.

With the release of TensorFlow 2, the module names have changed compared to previous versions of the article, and some processing has become easier to write. In this article, I will introduce how to write in TensorFlow 2, focusing on the difference from the previous one. Also, I will change Keras to use the one included in TensorFlow.

Advance preparation

This article uses Python 3.6.9 + TensorFlow 2.1.0 on Linux (Ubuntu 18.04).

Starting with TensorFlow 1.15 / 2.1, the CPU and GPU versions of the pip package have been integrated. Therefore, for those who want to try it easily with CPU and those who want to turn it in earnest with GPU

pip3 install tensorflow==2.1.0

This is OK. Please note that if you want to use GPU, you need to set up CUDA 10.1. GPU support | TensorFlow

Data preparation

There is a unique data format (TFRecord) that allows TensorFlow to calculate efficiently. Let's create a TFRecord from existing data using the DataSet API.

`data2tfrecord.py`


#!/usr/bin/env python3

import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import mnist

def feature_float_list(l):
    return tf.train.Feature(float_list=tf.train.FloatList(value=l))

def record2example(r):
    return tf.train.Example(features=tf.train.Features(feature={
        "x": feature_float_list(r[0:-1]),
        "y": feature_float_list([r[-1]])
    }))

filename_train = "train.tfrecords"
filename_test  = "test.tfrecords"

# ===Read MNIST data===
#For the sake of simplicity, let's assume that the same evaluation data is used for the verification data during training.
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print("x_train   : ", x_train.shape) # x_train   :  (60000, 28, 28)
print("y_train   : ", y_train.shape) # y_train   :  (60000,)
print("x_test    : ", x_test.shape)  # x_test    :  (10000, 28, 28)
print("y_test    : ", y_test.shape)  # y_test    :  (10000,)

#Pre-process
#Pixels[0, 1]Convert to float32 type
#Furthermore, for TFRecording, the features are made one-dimensional (rows correspond to records).
x_train = x_train.reshape((-1, 28*28)).astype("float32") / 255.0
x_test  = x_test.reshape((-1, 28*28)).astype("float32") / 255.0
#Label is also float32 type
y_train = y_train.reshape((-1, 1)).astype("float32")
y_test  = y_test.reshape((-1, 1)).astype("float32")
#Combine features and labels for TFRecording
data_train = np.c_[x_train, y_train]
data_test = np.c_[x_test,  y_test]

#Actually, the data you want to learn is converted to the same format and created.
#If all the data does not fit in memory, go to the write phase below
#You can make it little by little and repeat writing.

#Write training data to TFRecord
with tf.io.TFRecordWriter(filename_train) as writer:
    for r in data_train:
        ex = record2example(r)
        writer.write(ex.SerializeToString())

#Write evaluation data to TFRecord
with tf.io.TFRecordWriter(filename_test) as writer:
    for r in data_test:
        ex = record2example(r)
        writer.write(ex.SerializeToString())

It's almost the same as last time, but with the version upgrade of TensorFlow, the package tensorflow.python_io has disappeared, and the functions related to TFRecord have been added to tensorflow.io. Also, since I changed to use Keras included in TensorFlow, `ʻimport`` has changed, but the method of reading the MNIST dataset itself has not changed.

If you don't have a library for GPU calculation, you will get WARNING related to CUDA (libcublas cannot be found, etc.), but if you just want to try it lightly on the CPU, you don't have to worry about it.

Learning

It has changed a little from the last time. Let's start with the code and then the diffs.

`train.py`


#!/usr/bin/env python3

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Model

#Learning settings
batch_size = 32
epochs = 10
#Feature setting
num_classes = 10    #Label type. 0-10 types of 9
feature_dim = 28*28 #Feature dimension. Handle as 1D for simplicity
#Number of learning / evaluation data. Check in advance.
#Note that when using multiple TFRecords, the number below is the sum of all files.
num_records_train = 60000
num_records_test  = 10000
#Number of mini-batch per epoch. Used when learning.
steps_per_epoch_train = (num_records_train-1) // batch_size + 1
steps_per_epoch_test  = (num_records_test-1) // batch_size + 1

#Decode 1 TFRecord
def parse_example(example):
    features = tf.io.parse_single_example(
        example,
        features={
            #Specify the number of dimensions when reading the list
            "x": tf.io.FixedLenFeature([feature_dim], dtype=tf.float32),
            "y": tf.io.FixedLenFeature([], dtype=tf.float32)
        })
    x = features["x"]
    y = features["y"]
    return x, y

# ===Prepare TFRecord file data for learning and evaluation===

dataset_train = tf.data.TFRecordDataset(["train.tfrecords"]) \
    .map(parse_example) \
    .shuffle(batch_size * 100) \
    .batch(batch_size).repeat(-1)
#When using multiple TFRecord files above, specify a list of file names.
# dataset_train = tf.data.TFRecordDataset(["train.tfrecords.{}".format(i) for i in range(10)]) \

dataset_test = tf.data.TFRecordDataset(["test.tfrecords"]) \
    .map(parse_example) \
    .batch(batch_size)

# ===Model definition===
#This time, only one 512-dimensional intermediate layer is specified.
layer_input = Input(shape=(feature_dim,))
fc1 = Dense(512, activation="relu")(layer_input)
layer_output = Dense(num_classes, activation="softmax")(fc1)
model = Model(layer_input, layer_output)
model.summary()

#Loss even if the label is a categorical variable="sparse_categorical_crossentropy"Can be learned at
#Label one-loss when hot vectorized="categorical_crossentropy"become
model.compile(
    loss="sparse_categorical_crossentropy",
    optimizer=RMSprop(),
    metrics=["accuracy"])

# ===Learning===

#Save the model in the middle
cp_cb = ModelCheckpoint(
    filepath="weights.{epoch:02d}-{loss:.4f}-{val_loss:.4f}.hdf5",
    monitor="val_loss",
    verbose=1,
    save_best_only=True,
    mode="auto")
model.fit(
    x=dataset_train,
    epochs=epochs,
    verbose=1,
    steps_per_epoch=steps_per_epoch_train,
    validation_data=dataset_test,
    validation_steps=steps_per_epoch_test,
    callbacks=[cp_cb])

Difference from the previous time

tensorflow.keras.Model.fit () has changed to be able to take a DataSet for training data. tf.keras.Model | TensorFlow Core v2.1.0

x: Input data. It could be: (Omitted) A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).

Previously, when learning from a DataSet, you had to plug data into the `ʻInputlayer, so there was a tedious procedure of creating two models with shared weights, one for training and one for evaluation. In TensorFlow 2.x (Keras included in), you can give a DataSet to Model.fit () , so you only need one model. You no longer need to create your own iterator with make_one_shot_iterator () ``. You did it!

In addition, it was now possible to give a DataSet for evaluation to the validation_data argument of tensorflow.keras.Model.fit (). Therefore, it is no longer necessary to create a callback for evaluation by yourself (although the progress bar at the time of evaluation does not appear ... It is a story to write a learning loop by yourself).

Performance improvement with multiple TFRecords

By loading multiple files in parallel, you may be able to increase the GPU usage rate (= speed up learning).

Divide and write the training data in the same way as last time. The only difference from the last time is that tf.python_io has changed to tf.io.

`data2tfrecord.py (part)`


for i in range(10):
    with tf.io.TFRecordWriter(filename_train + "." + str(i)) as writer:
        for r in data_train[i::10]:
            ex = record2example(r)
            writer.write(ex.SerializeToString())

During learning, how to create dataset_train changes as follows.

`train.py (part)`


dataset_train = tf.data.Dataset.from_tensor_slices(["train.tfrecords.{}".format(i) for i in range(10)]) \
    .interleave(
        lambda filename: tf.data.TFRecordDataset(filename).map(parse_example, num_parallel_calls=1),
        cycle_length=10) \
    .shuffle(batch_size * 100) \
    .batch(batch_size) \
    .prefetch(1) \
    .repeat(-1)

The function equivalent to tf.contrib.data.parallel_interleave () (later tf.data.experimental.parallel_interleave ()) in the previous article was officially incorporated as a method of DataSet. So it's a little easier to write. However, since it behaves like sloppy = False, it seems that you need to specify options with with_options () to make it behave like sloppy = True. tf.data.experimental.parallel_interleave | TensorFlow Core v2.1.0

Let's move to TensorFlow 2

There are some changes, but it's generally easier to write, so I felt like I didn't have to be afraid. You can expect the core performance to improve (is that true?), And let's dig into a lot of data with TensorFlow 2!

[PYTHON] [TensorFlow 2.x compatible version] How to train a large amount of data using TFRecord & DataSet in TensorFlow (Keras)