Introduction

Well, as the title says, I've considered the fastest Data Loading and Data Augmentation on Kaggle notebook, so I'll introduce it. If you know a faster way, please let me know! The subject for this time is set as follows.

data
Use the data from this competition. 10.2k jpg format dog images.
- https://www.kaggle.com/c/dog-breed-identification/data
Execution environment
Use a Kaggle notebook with the GPU enabled.
- 2 CPU cores
- 13 GB RAM
- Tesla P100
Conditions
Measure the time it takes to load all train data (images and labels) into a Tensor on the GPU
Batch size is 64
Perform the following processing as preprocessing & Data Augmentation. (I chose only the processes that can align the operations as much as possible between different libraries.) * RandomResizedCrop * HorizontalFlip * VerticalFlip * MotionBlur * Rotate * Normalize

You can also try the code used in this article from this notebook. (I haven't debugged so much, so please point out any deficiencies ...) https://www.kaggle.com/hirune924/the-fastest-data-loading-data-augmentation?scriptVersionId=41763394

Also, for those who are dying to know the result, I will post the result first. ダウンロード.png

OpenCV + Albumentations I think it's a pretty basic combination now. I think there are more people using this than torchvison's Transform. I often use this combination at the beginning of the Kaggle competition.

`cv2_alb.py`


import time
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import cv2
import albumentations as A


class DogDataset(Dataset):
    def __init__(self, transform=None):
        self.img_list = pd.read_csv('../input/dog-breed-identification/labels.csv')
        self.transform = transform
        
        breeds=list(self.img_list['breed'].unique())
        self.breed2idx = {b: i for i, b in enumerate(breeds)}

    def __len__(self):
        return len(self.img_list)

    def __getitem__(self, idx):
        img_row = self.img_list.iloc[idx]
        image = cv2.imread('../input/dog-breed-identification/train/' + img_row['id'] + '.jpg')
        label = self.breed2idx[img_row['breed']]

        if self.transform is not None:
            image = self.transform(image=image)
        image = torch.from_numpy(image['image'].transpose(2, 0, 1))
        return image, label

transform = A.Compose([A.RandomResizedCrop(height=224, width=224, p=1),
                       A.HorizontalFlip(p=0.5),
                       A.VerticalFlip(p=0.5),
                       A.MotionBlur(blur_limit=3, p=1),
                       A.Rotate(limit=45, p=1),
                       A.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5), max_pixel_value=255.0, always_apply=True, p=1.0)])

data_loader = DataLoader(DogDataset(transform=transform), batch_size=64, shuffle=True, num_workers=2)

Read this as follows and measure the time.

`cv2_alb_time.py`


%%timeit -r 2 -n 5
opencv_alb_times = []
start_time = time.time()
for image, label in data_loader:
    image = image.cuda()
    label = label.cuda()
    pass
opencv_alb_time = time.time() - start_time
opencv_alb_times.append(opencv_alb_time)
print(str(opencv_alb_time) + ' sec')

The result is as follows.

98.37442970275879 sec
70.52895092964172 sec
66.72178149223328 sec
61.30395317077637 sec
68.30901885032654 sec
69.6796133518219 sec
71.02722263336182 sec
70.88462662696838 sec
70.54376363754272 sec
65.67756700515747 sec
1min 11s ± 1.74 s per loop (mean ± std. dev. of 2 runs, 5 loops each)

jpeg4py + Albumentations Whenever I write this kind of article, there are people who tell me to try using jpeg4py, so I will measure this as a baseline as well. Well, if the image format is jpeg, there is no way not to use this. First from installation

`install_jpeg4py.sh`


!apt-get install libturbojpeg
!pip install jpeg4py

Next is the code. It's almost the same except for reading data.

`jpeg4py_alb.py`


import time
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import cv2
import albumentations as A
import jpeg4py as jpeg


class DogDataset(Dataset):
    def __init__(self, transform=None):
        self.img_list = pd.read_csv('../input/dog-breed-identification/labels.csv')
        self.transform = transform
        
        breeds=list(self.img_list['breed'].unique())
        self.breed2idx = {b: i for i, b in enumerate(breeds)}

    def __len__(self):
        return len(self.img_list)

    def __getitem__(self, idx):
        img_row = self.img_list.iloc[idx]
        image = jpeg.JPEG('../input/dog-breed-identification/train/' + img_row['id'] + '.jpg').decode()
        label = self.breed2idx[img_row['breed']]

        if self.transform is not None:
            image = self.transform(image=image)
        image = torch.from_numpy(image['image'].transpose(2, 0, 1))
        return image, label

transform = A.Compose([A.RandomResizedCrop(height=224, width=224, p=1),
                       A.HorizontalFlip(p=0.5),
                       A.VerticalFlip(p=0.5),
                       A.MotionBlur(blur_limit=3, p=1),
                       A.Rotate(limit=45, p=1),
                       A.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5), max_pixel_value=255.0, always_apply=True, p=1.0)])

data_loader = DataLoader(DogDataset(transform=transform), batch_size=64, shuffle=True, num_workers=2)

The code to read and measure the time is almost the same as the previous one, so I will omit it. The result is as follows. After all jpeg4py is fast.

43.14848828315735 sec
42.78340029716492 sec
41.33797478675842 sec
43.24748754501343 sec
41.11549472808838 sec
41.17329430580139 sec
40.58435940742493 sec
41.16935634613037 sec
40.92542815208435 sec
39.6163330078125 sec
41.5 s ± 816 ms per loop (mean ± std. dev. of 2 runs, 5 loops each)

jpeg4py + Kornia Use Kornia for Data Augmentation. This makes it possible to process Data Augmentation on the GPU. Only the first Random Resized Crop uses Albumentations so that you can make a batch of tensors with well-shaped images. Since Kornia can process each batch, Data Augmentation is executed for the batch loaded by DataLoader.

The code is here.

`jpeg4py_kornia.py`


import time
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
import cv2
import jpeg4py as jpeg

import albumentations as A
import kornia.augmentation as K
import torch.nn as nn


class DogDataset(Dataset):
    def __init__(self, transform=None):
        self.img_list = pd.read_csv('../input/dog-breed-identification/labels.csv')
        self.transform = transform
        
        breeds=list(self.img_list['breed'].unique())
        self.breed2idx = {b: i for i, b in enumerate(breeds)}

    def __len__(self):
        return len(self.img_list)

    def __getitem__(self, idx):
        img_row = self.img_list.iloc[idx]
        image = jpeg.JPEG('../input/dog-breed-identification/train/' + img_row['id'] + '.jpg').decode()
        label = self.breed2idx[img_row['breed']]

        if self.transform is not None:
            image = self.transform(image=image)
        image = torch.from_numpy(image['image'].transpose(2, 0, 1).astype(np.float32))
        return image, label

alb_transform = A.Compose([A.RandomResizedCrop(height=224, width=224, p=1)])

mean_std = torch.Tensor([0.5, 0.5, 0.5])*255
kornia_transform = nn.Sequential(
    K.RandomHorizontalFlip(),
    K.RandomVerticalFlip(),
    K.RandomMotionBlur(3, 35., 0.5),
    K.RandomRotation(degrees=45.0),
    K.Normalize(mean=mean_std,std=mean_std)
)

data_loader = DataLoader(DogDataset(transform=alb_transform), batch_size=64, shuffle=True, num_workers=2)

The reading will be as follows. You can see that the conversion is done in batch after loading from DataLoader.

`jpeg4py_kornia_time.py`


%%timeit -r 2 -n 5
jpeg4py_kornia_times = []
start_time = time.time()
for image, label in data_loader:
    image = kornia_transform(image.cuda())
    label = label.cuda()
    pass
jpeg4py_kornia_time = time.time() - start_time
jpeg4py_kornia_times.append(jpeg4py_kornia_time)
print(str(jpeg4py_kornia_time) + ' sec')

The result is as follows. It's getting pretty fast.

28.150899171829224 sec
24.104888916015625 sec
25.490058183670044 sec
24.111201763153076 sec
22.999730587005615 sec
25.16165590286255 sec
26.496272325515747 sec
27.150801420211792 sec
28.757362365722656 sec
29.331339836120605 sec
26.2 s ± 1.2 s per loop (mean ± std. dev. of 2 runs, 5 loops each)

DALI + Kornia This is the fastest combination I can think of right now. I wrote this article because I wanted to say this. If you use DALI, you can use the GPU from the stage of loading the image, and when it is loaded, you can say that the image is already on the GPU. This made Albumentations difficult to use, and it was difficult to use because there are few types of Augmentations implemented in DALI, but this combination was realized because Augmentation on the GPU by Kornia has reached a practical level. First, install NVIDIA DALI.

`install_dali.sh`


!pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda100

And here is the code. The coat color is a little different from the previous ones. It's like defining a pipeline in DALI, building it, and creating an iterator that returns a PyTorch Tensor. RandomResizedCrop will be done in DALI, and after that it will be done in Kornia.

`dali_kornia.py`


import time
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd

import kornia.augmentation as K
import torch.nn as nn

from nvidia.dali.pipeline import Pipeline
from nvidia.dali.plugin.pytorch import DALIGenericIterator
import nvidia.dali.ops as ops
import nvidia.dali.types as types

class DALIPipeline(Pipeline):
    def __init__(self, batch_size, num_threads, device_id):
        super(DALIPipeline, self).__init__(batch_size, num_threads, device_id)
        self.img_list = pd.read_csv('../input/dog-breed-identification/labels.csv')

        breeds=list(self.img_list['breed'].unique())
        self.breed2idx = {b: i for i, b in enumerate(breeds)}
        
        self.img_list['label'] = self.img_list['breed'].map(self.breed2idx)
        self.img_list['data'] = '../input/dog-breed-identification/train/' + self.img_list['id'] + '.jpg'
        
        self.img_list[['data', 'label']].to_csv('dali.txt', header=False, index=False, sep=' ')
        
        self.input = ops.FileReader(file_root='.', file_list='dali.txt')
        self.decode = ops.ImageDecoder(device = "mixed", output_type = types.DALIImageType.RGB)
        #self.decode = ops.ImageDecoderRandomCrop(device = "mixed", output_type = types.DALIImageType.RGB)
        self.resize = ops.RandomResizedCrop(device = "gpu", size=(224, 224))
        self.transpose = ops.Transpose(device='gpu', perm = [2, 0, 1])
        self.cast = ops.Cast(device='gpu', dtype=types.DALIDataType.FLOAT)

    def define_graph(self):
        images, labels = self.input(name="Reader")
        images = self.decode(images)
        images = self.resize(images)
        images = self.cast(images)
        output = self.transpose(images)
        return (output, labels)

def DALIDataLoader(batch_size):
    num_gpus = 1
    pipes = [DALIPipeline(batch_size=batch_size, num_threads=2, device_id=device_id) for device_id in range(num_gpus)]

    pipes[0].build()
    dali_iter = DALIGenericIterator(pipelines=pipes, output_map=['data', 'label'], 
                                    size=pipes[0].epoch_size("Reader"), reader_name=None, 
                                    auto_reset=True, fill_last_batch=True, dynamic_shape=False, 
                                    last_batch_padded=True)
    return dali_iter

data_loader = DALIDataLoader(batch_size=64)

mean_std = torch.Tensor([0.5, 0.5, 0.5])*255
kornia_transform = nn.Sequential(
    K.RandomHorizontalFlip(),
    K.RandomVerticalFlip(),
    K.RandomMotionBlur(3, 35., 0.5),
    K.RandomRotation(degrees=45.0),
    K.Normalize(mean=mean_std,std=mean_std)
)

The reading part is as follows.

`dali_kornia_time.py`


%%timeit -r 2 -n 5
dali_kornia_times = []
start_time = time.time()
for feed in data_loader:
    # image is already on GPU
    image = kornia_transform(feed[0]['data'])
    label = feed[0]['label'].cuda()
    pass
dali_kornia_time = time.time() - start_time
dali_kornia_times.append(dali_kornia_time)
print(str(dali_kornia_time) + ' sec')

The result is as follows: It's explosive! It's too fast! It's already a speed violation!

8.865531921386719 sec
7.996037721633911 sec
8.494542598724365 sec
8.241464853286743 sec
8.093241214752197 sec
8.12808108329773 sec
7.846079587936401 sec
7.849750280380249 sec
7.848227024078369 sec
7.633721828460693 sec
8.1 s ± 238 ms per loop (mean ± std. dev. of 2 runs, 5 loops each)

Aggregate results

I tried to make a bar graph easily. ダウンロード.png

at the end

This time, I introduced the method of explosive data loading and data augmentation using NVIDIA DALI and Kornia. Kornia's Augmentation is also inferior to the level of Albumentations, but the issue roadmap includes a description that is aware of Albumentsations, so I'm looking forward to it in the future! https://github.com/kornia/kornia/issues/434

[PYTHON] [Challenger Wanted] The fastest Data Loading and Data Augmentation (Kaggle notebook) I can think of

Introduction

cv2_alb.py

cv2_alb_time.py

install_jpeg4py.sh

jpeg4py_alb.py

jpeg4py_kornia.py

jpeg4py_kornia_time.py

install_dali.sh

dali_kornia.py

dali_kornia_time.py