[PYTHON] PyTorch Sangokushi (Ignite / Catalyst / Lightning)

0. Introduction

I think that all deep learning frameworks are areas where development is very fast and exciting. While there are TensorFlow, jax, etc., there is also PFN News the other day, and PyTorch seems to be more solid. Perhaps there will be more PyTorch users in the future (I won't mention the ** official Trainer ** that was also in Chainer, as it would jeopardize the existence of this article in PyTorch).

However, while PyTorch has a high degree of freedom, the code around learning (such as around the loop of each epoch) is left to the individual, and it tends to be very unique code.

Writing these codes yourself is a lot of learning and I think it's a must ** to get started with PyTorch. However, if the individuality is too strong, it can be difficult to share with other people or reuse it between competitions (ex. Oreore Trainer, which is often seen in Winner Solutions).

In the case of PyTorch, there is no framework in itself to simplify the code around learning (previously there was a Trainer but it was abolished), but [Ecosystem \ | PyTorch](https: // pytorch) The following frameworks for PyTorch are introduced in .org / ecosystem /).

There are many. The truth is, try them all and find the one that suits you. However, they are not easy, so in this article, I will introduce each framework and make a simple comparison, and I hope that it will be a guide for you to touch.

Note that ** fastai ** has a high degree of abstraction (the code tends to be short), but I felt that the learning cost for adding detailed operations by myself was high, so ** Comparison in this article Then I have omitted it in advance **.

Therefore, in this article, we will focus on ** Catalyst **, ** Ignite **, and ** Lightning **, and make comparisons assuming that you will participate in the ** Kaggle competition **.

By the way, I have confirmed that these three frameworks work to some extent in advance. If you have read this article and want to go a little further, please refer to it.

--Code in this article - https://github.com/yukkyo/Compare-PyTorch-Catalyst-Ignite-Lightning

As I mentioned earlier, ** all have the potential to participate in the competition. ** **

1. Target of this article (or not)

--I'm interested in the Kaggle competition --I'm interested in image competition --I'm particularly interested in Classification, Segmentation, and Detection. --PyTorch I've touched it ――If you haven't touched it, it's not the case when you read this article. ――Let's start with the following pages and books - Welcome to PyTorch Tutorials — PyTorch Tutorials 1.3.1 documentation - Learn while making! Development deep learning by PyTorch|Yutaro Ogawa|engineering|Kindle Store| Amazon

2. Comparison of each framework (Catalyst, Ignite, Lightning)

I used the latest version on pip as of December 13, 2019. The Python version is 3.7.5. ~~ It is also assumed that NVIDIA / apex is already installed. ~~ You don't need apex to run this code.

torch==1.3.1
torchvision==0.4.2
catalyst==19.12
pytorch-ignite==0.2.1
pytorch-lightning==0.5.3.2

2.1 Star number transition (as of December 10, 2019)

While Catalyst and Ignite are growing steadily, Lightning has grown tremendously since April this year. On the other hand, Lightning is less than a year old, so there are many features under development, and it should be noted that it is still unstable (not backward compatible when upgrading the version, etc.).

It's also possible that Catalyst will overtake Ignite, as we often see Catalyst on Kaggle Notebooks these days.

2.2 How to write

Let's see what happens when we apply each framework to plain PyTorch learning code.

2.2.1 Common part

This time I will try to learn with Resnet18 for cifer10 dataset. As shown in the code below, the model and Dataloader definitions are made into functions in advance.

Common part code (I folded it because it is long)

share_funcs.py


import torch
import torch.nn as nn
from torch import optim
from torch.utils.data import DataLoader
from torchvision import datasets, models, transforms

def get_criterion():
    """A function that returns Loss nicely"""
    return nn.CrossEntropyLoss()

def get_loaders(batch_size: int = 16, num_workers: int = 4):
    """A function that can return each Dataloader nicely"""
    transform = transforms.Compose([transforms.ToTensor()])

    # Dataset
    args_dataset = dict(root='./data', download=True, transform=transform)
    trainset = datasets.CIFAR10(train=True, **args_dataset)
    testset = datasets.CIFAR10(train=False, **args_dataset)

    # Data Loader
    args_loader = dict(batch_size=batch_size, num_workers=num_workers)

    train_loader = DataLoader(trainset, shuffle=True, **args_loader)
    val_loader = DataLoader(testset, shuffle=False, **args_loader)
    return train_loader, val_loader

def get_model(num_class: int = 10):
    """A function that returns the model nicely"""
    model = models.resnet18(pretrained=True)
    num_features = model.fc.in_features
    model.fc = nn.Linear(num_features, num_class)
    return model

def get_optimizer(model: torch.nn.Module, init_lr: float = 1e-3, epoch: int = 10):
    optimizer = optim.SGD(model.parameters(), lr=init_lr, momentum=0.9)
    lr_scheduler = optim.lr_scheduler.MultiStepLR(
        optimizer,
        milestones=[int(epoch*0.8), int(epoch*0.9)],
        gamma=0.1
    )
    return optimizer, lr_scheduler

2.2.1 Base code (plain learning code)

If you write it honestly without thinking too much, it will be as follows. .to (device), loss.backward (), and ʻoptimizer.step ()have to be written, so they tend to be long. Also,with torch.no_grad ()can be made compatible with both Train and Eval by usingtorch.set_grad_enabled (bool), but there are many different processes between Train and Eval (ex. ʻOptimizer.step () , metrics, etc.), if you create a function that supports both, the outlook tends to be worse.

Base code (folded because it's long)
def train(model, data_loader, criterion, optimizer, device, grad_acc=1):
    model.train()

    # zero the parameter gradients
    optimizer.zero_grad()

    total_loss = 0.
    for i, (inputs, labels) in tqdm(enumerate(data_loader), total=len(data_loader)):
        inputs = inputs.to(device)
        labels = labels.to(device)

        outputs = model(inputs)

        loss = criterion(outputs, labels)
        loss.backward()

        # Gradient accumulation
        if (i % grad_acc) == 0:
            optimizer.step()
            optimizer.zero_grad()

        total_loss += loss.item()

    total_loss /= len(data_loader)
    metrics = {'train_loss': total_loss}
    return metrics


def eval(model, data_loader, criterion, device):
    model.eval()
    num_correct = 0.

    with torch.no_grad():
        total_loss = 0.
        for inputs, labels in tqdm(data_loader, total=len(data_loader)):
            inputs = inputs.to(device)
            labels = labels.to(device)

            outputs = model(inputs)
            _, preds = torch.max(outputs, 1)

            loss = criterion(outputs, labels)

            total_loss += loss.item()
            num_correct += torch.sum(preds == labels.data)

        total_loss /= len(data_loader)
        num_correct /= len(data_loader.dataset)
        metrics = {'valid_loss': total_loss, 'val_acc': num_correct}
    return metrics


def main():
    epochs = 10

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = get_model()
    train_loader, val_loader = get_loaders()
    optimizer, lr_scheduler = get_optimizer(model=model)
    criterion = get_criterion()

    #Model multi-gpu or support FP16
    model = model.to(device)

    print('Train start !')
    for epoch in range(epochs):
        print(f'epoch {epoch} start !')
        metrics_train = train(model, train_loader, criterion, optimizer, device)
        metrics_eval = eval(model, val_loader, criterion, device)

        lr_scheduler.step()

        #Processing around Logger
        #A messy process for printing
        print(f'epoch: {epoch} ', metrics_train, metrics_eval)

        #Write here a process that would be even more confusing if you used tqdm
        #Process for saving Model
        # Multi-Write more carefully for GPU

2.2.2 Catalyst

For Catalyst, just pass what you need to SupervisedRunner in the library and you're done. It's really smart! Also, major metrics such as Accuracy and Dice are in Catalyst, so you rarely write them yourself (introducing your own metrics seemed relatively easy). It seems that it is usually troublesome to keep the default, but if you want to add detailed processing yourself, it seems that you need to investigate a little.

import catalyst
from catalyst.dl import SupervisedRunner
from catalyst.dl.callbacks import AccuracyCallback
from share_funcs import get_model, get_loaders, get_criterion, get_optimizer

def main():
    epochs = 5
    num_class = 10
    output_path = './output/catalyst'

    model = get_model()
    train_loader, val_loader = get_loaders()
    loaders = {"train": train_loader, "valid": val_loader}

    optimizer, lr_scheduler = get_optimizer(model=model)
    criterion = get_criterion()

    runner = SupervisedRunner(device=catalyst.utils.get_device())
    runner.train(
        model=model,
        criterion=criterion,
        optimizer=optimizer,
        scheduler=lr_scheduler,
        loaders=loaders,
        logdir=output_path,
        callbacks=[AccuracyCallback(num_classes=num_class, accuracy_args=[1])],
        num_epochs=epochs,
        main_metric="accuracy01",
        minimize_metric=False,
        fp16=None,
        verbose=True
    )

2.2.3 Ignite

Ignite has a slightly different coat color than Catalyst and Lightning, which will be described later. As shown below, it is an image of inserting the processing that you want to sandwich with @ trainer.on (Events.EPOCH_COMPLETED) etc. for each timing. In addition, Ignite is also officially prepared with Accuracy etc., so if it is a major evaluation index, it seems that you do not have to define it yourself.

On the other hand, it seems that you need to get used to it, and there is a high degree of freedom in how to sandwich the event (you can also add it like trainder.append), so if you make a mistake, the overall outlook may deteriorate.

import torch
from ignite.engine import Events, create_supervised_trainer, create_supervised_evaluator
from ignite.metrics import Accuracy, Loss, RunningAverage
from ignite.contrib.handlers import ProgressBar
from share_funcs import get_model, get_loaders, get_criterion, get_optimizer

def run(epochs, model, criterion, optimizer, scheduler,
        train_loader, val_loader, device):
    trainer = create_supervised_trainer(model, optimizer, criterion, device=device)
    evaluator = create_supervised_evaluator(
        model,
        metrics={'accuracy': Accuracy(), 'nll': Loss(criterion)},
        device=device
    )

    RunningAverage(output_transform=lambda x: x).attach(trainer, 'loss')

    pbar = ProgressBar(persist=True)
    pbar.attach(trainer, metric_names='all')

    @trainer.on(Events.EPOCH_COMPLETED)
    def log_training_results(engine):
        scheduler.step()
        evaluator.run(train_loader)
        metrics = evaluator.state.metrics
        avg_accuracy = metrics['accuracy']
        avg_nll = metrics['nll']
        pbar.log_message(
            "Training Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f}"
            .format(engine.state.epoch, avg_accuracy, avg_nll)
        )

    @trainer.on(Events.EPOCH_COMPLETED)
    def log_validation_results(engine):
        evaluator.run(val_loader)
        metrics = evaluator.state.metrics
        avg_accuracy = metrics['accuracy']
        avg_nll = metrics['nll']
        pbar.log_message(
            "Validation Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f}"
            .format(engine.state.epoch, avg_accuracy, avg_nll))

        pbar.n = pbar.last_print_n = 0

    trainer.run(train_loader, max_epochs=epochs)

def main():
    epochs = 10
    train_loader, val_loader = get_loaders()
    model = get_model()
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    optimizer, scheduler = get_optimizer(model)
    criterion = get_criterion()

    run(
        epochs=epochs,
        model=model,
        criterion=criterion,
        optimizer=optimizer,
        scheduler=scheduler,
        train_loader=train_loader,
        val_loader=val_loader,
        device=device
    )

2.2.4 Lightning

For Lightning, you need to define a class that inherits from LightningModule (like a Trainer class).

The name of each step (ex. Training_step) is fixed, and you fill in each step yourself. In addition, the training itself is performed by the pytorch_lightning.Trainer class, and settings such as GPU, MixedPrecision, and gradient accumulation are set in this class. Also, metrics are not available in Lightning, so you will have to write them yourself.

import torch
import pytorch_lightning as pl
from pytorch_lightning import Trainer
from share_funcs import get_model, get_loaders, get_criterion, get_optimizer

class MyLightninModule(pl.LightningModule):
    def __init__(self, num_class):
        super(MyLightninModule, self).__init__()
        self.model = get_model(num_class=num_class)
        self.criterion = get_criterion()

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        # REQUIRED
        x, y = batch
        y_hat = self.forward(x)
        loss = self.criterion(y_hat, y)
        logs = {'train_loss': loss}
        return {'loss': loss, 'log': logs, 'progress_bar': logs}

    def validation_step(self, batch, batch_idx):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        preds = torch.argmax(y_hat, dim=1)
        return {'val_loss': self.criterion(y_hat, y), 'correct': (preds == y).float()}

    def validation_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        acc = torch.cat([x['correct'] for x in outputs]).mean()
        logs = {'val_loss': avg_loss, 'val_acc': acc}
        return {'avg_val_loss': avg_loss, 'log': logs}

    def configure_optimizers(self):
        # REQUIRED
        optimizer, scheduler = get_optimizer(model=self.model)
        return [optimizer], [scheduler]

    @pl.data_loader
    def train_dataloader(self):
        # REQUIRED
        return get_loaders()[0]

    @pl.data_loader
    def val_dataloader(self):
        # OPTIONAL
        return get_loaders()[1]


def main():
    epochs = 5
    num_class = 10
    output_path = './output/lightning'

    model = MyLightninModule(num_class=num_class)

    # most basic trainer, uses good defaults
    trainer = Trainer(
        max_nb_epochs=epochs,
        default_save_path=output_path,
        gpus=[0],
        # use_amp=False,
    )
    trainer.fit(model)

2.3 Console screen and output when executed by each framework

2.3.1 Default

Console screen

$ python train_default.py
Files already downloaded and verified
Files already downloaded and verified
Train start !
epoch 0 start !
100%|_____| 196/196 [00:05<00:00, 33.44it/s]
100%|_____| 40/40 [00:00<00:00, 50.43it/s]
epoch: 0  {'train_loss': 1.3714478426441854} {'valid_loss': 0.992230711877346, 'val_acc': tensor(0, device='cuda:0')}

Output

None

2.3.1 Catalyst

Console screen

$ python train_catalyst.py
1/5 * Epoch (train): 100% 196/196 [00:06<00:00, 30.09it/s, accuracy01=61.250, loss=1.058]
1/5 * Epoch (valid): 100% 40/40 [00:00<00:00, 49.75it/s, accuracy01=56.250, loss=1.053]
[2019-12-14 08:47:33,819]
1/5 * Epoch 1 (train): _base/lr=0.0010 | _base/momentum=0.9000 | _timers/_fps=58330.0450 | _timers/batch_time=0.0071 | _timers/data_time=0.0045 | _timers/model_time=0.0026 | accuracy01=52.0863 | loss=1.3634
1/5 * Epoch 1 (valid): _base/lr=0.0010 | _base/momentum=0.9000 | _timers/_fps=77983.3850 | _timers/batch_time=0.0146 | _timers/data_time=0.0126 | _timers/model_time=0.0019 | accuracy01=65.6250 | loss=0.9848
2/5 * Epoch (train): 100% 196/196 [00:06<00:00, 30.28it/s, accuracy01=63.750, loss=0.951]

Output

--Tensorboard etc. are output by default --weight is also saved ――It's a little nice to leave the code by default.

catalyst
├── checkpoints
│   └── train.1.exception_KeyboardInterrupt.pth
├── code
│   ├── share_funcs.py
│   ├── train_catalyst.py
│   ├── train_default.py
│   └── train_lightning.py
├── log.txt
└── train_log
    └── events.out.tfevents.1576306176.FujimotoMac.local.41575.0

2.3.2 Ignite

Console screen

The screen is a little cleaner than Catalyst.

$ python train_ignite.py
Epoch [1/10]: [196/196] 100%|________________, loss=1.14 [00:05<00:00]
Training Results - Epoch: 1  Avg accuracy: 0.69 Avg loss: 0.88
Validation Results - Epoch: 1  Avg accuracy: 0.65 Avg loss: 0.98
Epoch [2/10]: [196/196] 100%|________________, loss=0.813 [00:05<00:00]
Training Results - Epoch: 2  Avg accuracy: 0.78 Avg loss: 0.65
Validation Results - Epoch: 2  Avg accuracy: 0.70 Avg loss: 0.83

Output

--None -It seems that you need to write the part to save yourself or use the class in Ignite

2.3.3 Lightning

Console screen

By default, Lightning seems to display everything in the bar inside tqdm.

$ python train_lightning.py
Epoch 1: 100%|_____________| 236/236 [00:07<00:00, 30.75batch/s, batch_nb=195, gpu=0, loss=1.101, train_loss=1.06, v_nb=5]
Epoch 4:  41%|_____________| 96/236 [00:03<00:04, 32.28batch/s, batch_nb=95, gpu=0, loss=0.535, train_loss=0.524, v_nb=5]

Output

--Lightning creates and saves the following directories like version_x if there are duplicate directories. (Although it can be annoying and you may define your own checkpoint) --For Lightning, the parameters passed to LightningModule are automatically saved in meta_tags.csv. --Log for Tensorboard is also created by default --weight is also stored in checkpoints --By default, _ckpt_epoch_X.ckpt is created for each epoch, which seems to remove the ckpt from the old epoch.

lightning
└── lightning_logs
    ├── version_0
    │   └── checkpoints
    │       └── _ckpt_epoch_4.ckpt
    │   ├── media
    │   ├── meta.experiment
    │   ├── meta_tags.csv
    │   ├── metrics.csv
    │   └── tf
    │       └── events.out.tfevents.1576305970
    ├── version_1
    │   └── checkpoints
    │       └── _ckpt_epoch_3.ckpt
    │   ├── ...

2.4 Other remarkable places

Both support Early Stopping, etc.

2.4.1 Catalyst

--Reproducibility-related functions such as catalyst.utils.set_global_seed () are also provided. --Functions that allow you to write a simplified Dataset are also supported. - create_dataset, create_dataframe, prepare_dataset_labeling, split_dataframe - catalyst.utils.pandas --Multi GPU and FP16 are also supported --The quality of the official tutorial is high --There is also an official Docker file, and it seems that we are aiming for a framework that is also conscious of configuration management around infrastructure (probably)

2.4.2 Ignite

--Tensorboard and Logger are also in Ignite and can be called and used --The degree of freedom seems to be the highest ――It seems that you need to get used to how to sandwich the event. --It is located under the official repository

2.4.3 Lightning

--Multi GPU and FP16 are also supported

2.5 If you recommend

Both have potential and are not compulsory. Below are my personal impressions.

――It's my first time in an image competition and I'm not sure what to do - → PyTorch Catalyst ――I'm used to the image competition and want to participate in the competition with a murderous intention (strong feeling to get a gold medal) -→ ** PyTorch Lightning ** or ** PyTorch Catalyst ** ――I want to work on various image-related tasks, not limited to Classification / Segmentation -→ ** PyTorch Lightning ** or ** PyTorch Catalyst ** --There is also sample code for reinforcement learning in Catalyst ――I want to write in Osare - → PyTorch Ignite ――I want to use the framework with peace of mind at the official PyTorch knees - → PyTorch Ignite --Ignite is in the official PyTorch repository

3. To move freely between Catalyst, Ignite, and Lightning

You can narrow down the frameworks used here, but ** It shouldn't be a problem if you write each framework so that you can easily switch between them. ** So here's a summary of what you might want to keep in mind when writing PyTorch code.

--Remove the contents of the loop as much as possible --It seems good to be aware that the processing of each step (processing for each batch in ex. Train) can be extracted. ――At least once you start writing the triple loop, it seems good to be aware of whether you can extract the contents of the loop. --Do not increase function arguments too much --You can use instance variables as Class, or you can use Config as one. --Create a function that calls optimizer or model ――It is an individual impression --Combine Config into one ――When participating in a competition, it is easier to manage if you put the settings in one place as much as possible. --Example --Create a Config class --Create something like a dictionary that is easy to call with Addict etc. --Read the settings written in the YAML file

4. Finally

In this article, we compared PyTorch Catalyst, Ignite, and Lightning. In each case, the part of wanting to eliminate the fixed phrase is the same, but the result is that each has its own individuality. Every framework has potential, so if you feel it suits you, you should go to a competition and use it.

Have a good Kaggle (with PyTorch) life!

Recommended Posts

PyTorch Sangokushi (Ignite / Catalyst / Lightning)
Introduction to Lightning pytorch