[PYTHON] [Google Colab] How to interrupt learning and then resume it

1.First of all

Here are some useful tips for using Google Colab.

2. Challenges

When using Google Colab Pro, there is a limit of up to 24 hours of execution time. Therefore, if the amount of calculation of the model exceeds 24 hours, there is a problem that the calculation result disappears in the middle.

For example, if you estimate that it takes about 24 hours to calculate 200 epochs, and then actually run the calculation, it will take a little time, and Google Colab may be disconnected near 190 epochs.

3. Solution

To solve this, we will adopt the following method.

  1. Use Keras' ModelCheckpoint () to save the model in detail.
  2. Save the model to the temp folder of Google Drive (Reference: Organize Google Colab Tips)
  3. Call the model from the middle and restart learning when the calculation is completed.

3.1. ModelCheckpoint () settings


from tensorflow.keras.callbacks import ModelCheckpoint

checkpoint = ModelCheckpoint(filepath = 'XXX.h5',

Argument description

1.filepath: The path to save the character string and model file 2. monitor: Value to monitor 3. save_best_only: If save_best_only = True, the monitored data will not overwrite the latest best model 4.mode: One of {auto, min, max} will be selected 5. save_weights_only: If True, the weights of the model will be saved. Otherwise, the entire model will be saved. 6.period: Interval between checkpoints (number of epochs)

3.2. First learning → Save intermediate results to Google Drive

Write the code using the Keras MNIST case study.


from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.datasets import mnist

Google Drive Mount, model save folder settings


from google.colab import drive

MODEL_DIR = "/content/drive/My Drive/temp"

if not os.path.exists(MODEL_DIR):  #If the directory does not exist, create it.
checkpoint = ModelCheckpoint(
    filepath=os.path.join(MODEL_DIR, "model-{epoch:02d}.h5"), save_best_only=True) 

Learning execution


history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,  validation_split=0.1, callbacks=[checkpoint])

image.png image.png

Running the above code will save the model file in the temp folder.

3.3. Second learning → Call the first model and restart learning from the middle

It starts by calling model-05.h5.

Model loading


#Model loading
model.load_weights(os.path.join(MODEL_DIR, "model-05.h5"))  #Specify the model of

Renamed the second learning model

Change model-XX.h to model_new-XX.h.


if not os.path.exists(MODEL_DIR):  #If the directory does not exist, create it.
checkpoint = ModelCheckpoint(
    filepath=os.path.join(MODEL_DIR, "model_new-{epoch:02d}.h5"), 
    monitor = 'loss',

Continue learning execution


history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,  validation_split=0.1, callbacks=[checkpoint])

Looking at the value of Training acc, we can see that training has resumed since the last training was completed.

image.png image.png

The newly trained model is also saved.

4. Summary

  1. There is a problem with Google Colab Pro being disconnected for 24 hours.
  2. I decided to solve this problem with Keras' ModelCheckpoint () and Google Drive mount.
  3. We confirmed the operation and effectiveness of the proposed method.

4. Overall code

First learning


from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
import os
import matplotlib.pyplot as plt

from google.colab import drive
MODEL_DIR = "/content/drive/My Drive/temp"
if not os.path.exists(MODEL_DIR):  #If the directory does not exist, create it.
checkpoint = ModelCheckpoint(
    filepath=os.path.join(MODEL_DIR, "model-{epoch:02d}.h5"), save_best_only=True) 


(Xtrain, ytrain), (Xtest, ytest) = mnist.load_data()
Xtrain = Xtrain.reshape(60000, 784).astype("float32") / 255
Xtest = Xtest.reshape(10000, 784).astype("float32") / 255
Ytrain = to_categorical(ytrain, 10)
Ytest = to_categorical(ytest, 10)
print(Xtrain.shape, Xtest.shape, Ytrain.shape, Ytest.shape)
#Model definition
model = Sequential()
model.add(Dense(512, input_shape=(784,), activation="relu"))
model.add(Dense(512, activation="relu"))
model.add(Dense(10, activation="softmax"))
model.compile(optimizer="rmsprop", loss="categorical_crossentropy",

#Learning execution
history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,  validation_split=0.1, callbacks=[checkpoint])

#Graph drawing
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

plot_epochs = range(1, len(acc)+1)
# Accuracy
plt.plot(plot_epochs, acc, 'bo-', label='Training acc')
plt.plot(plot_epochs, val_acc, 'b', label='Validation acc')
plt.title('model accuracy')
plt.ylabel('accuracy')  #Y-axis label
plt.xlabel('epoch')  #X-axis label

loss = history.history['loss']
val_loss = history.history['val_loss']

plot_epochs = range(1, len(loss)+1)
# Accuracy
plt.plot(plot_epochs, loss, 'ro-', label='Training loss')
plt.plot(plot_epochs, val_loss, 'r', label='Validation loss')
plt.title('model loss')
plt.ylabel('loss')  #Y-axis label
plt.xlabel('epoch')  #X-axis label

Second and subsequent learning


from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical
import os
import matplotlib.pyplot as plt

from google.colab import drive
MODEL_DIR = "/content/drive/My Drive/temp"
if not os.path.exists(MODEL_DIR):  #If the directory does not exist, create it.
checkpoint = ModelCheckpoint(
    filepath=os.path.join(MODEL_DIR, "model-{epoch:02d}.h5"), save_best_only=True) 

#Model loading
model.load_weights(os.path.join(MODEL_DIR, "model-05.h5"))  #Specify the model of

if not os.path.exists(MODEL_DIR):  #If the directory does not exist, create it.
checkpoint = ModelCheckpoint(
    filepath=os.path.join(MODEL_DIR, "model_new-{epoch:02d}.h5"), 
    monitor = 'loss',

#Resume learning
history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS,  validation_split=0.1, callbacks=[checkpoint])

#Graph drawing
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

plot_epochs = range(1, len(acc)+1)
# Accuracy
plt.plot(plot_epochs, acc, 'bo-', label='Training acc')
plt.plot(plot_epochs, val_acc, 'b', label='Validation acc')
plt.title('model accuracy')
plt.ylabel('accuracy')  #Y-axis label
plt.xlabel('epoch')  #X-axis label

loss = history.history['loss']
val_loss = history.history['val_loss']

plot_epochs = range(1, len(loss)+1)
# Accuracy
plt.plot(plot_epochs, loss, 'ro-', label='Training loss')
plt.plot(plot_epochs, val_loss, 'r', label='Validation loss')
plt.title('model loss')
plt.ylabel('loss')  #Y-axis label
plt.xlabel('epoch')  #X-axis label

5. Reference materials

  1. Google Colab Tips Organized
  2. [How to interrupt learning in Keras and then restart it](https://intellectual-curiosity.tokyo/2019/06/25/keras%e3%81%a7%e5%ad%a6%e7% bf% 92% e3% 82% 92% e4% b8% ad% e6% 96% ad% e3% 81% 97% e3% 81% 9f% e5% be% 8c% e3% 80% 81% e9% 80% 94% e4% b8% ad% e3% 81% 8b% e3% 82% 89% e5% 86% 8d% e9% 96% 8b% e3% 81% 99% e3% 82% 8b% e6% 96% b9% e6% b3% 95 /? unapproved = 1126 & moderation-hash = a1d80e5413be867d6179fd011c317d71 # comment-1126)
  3. Save the best model (How to use ModelCheckpoint)

