[PYTHON] [PyTorch Tutorial ⑧] Torch Vision Object Detection Finetuning Tutorial

Introduction

This is the 8th installment of PyTorch Official Tutorial following Last time. This time, we will proceed with TorchVision Object Detection Finetuning Tutorial.

TorchVision Object Detection Finetuning Tutorial

In this tutorial, we will use the pre-trained Mask R-CNN to see fine tuning and transfer learning. The data used for learning is Penn-Fudan data for pedestrian detection and segmentation. For this data, 170 images with 345 pedestrians (instances) are prepared.

First, you need to install the pycocotools library. This library is used to calculate a rating called "Intersection over Union". "Intersection over Union" is one of the methods to evaluate the matching condition of areas in object detection.

%%shell

pip install cython
#Install pycocotools. The default version of Colab is https://github.com/cocodataset/cocoapi/pull/There is a bug fixed in 354.
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

Defining the Dataset

Define the dataset. The dataset requires certain attributes to take advantage of the model trained by Mask R-CNN. Data required by using torchvision script (object detection, instance segmentation, person keypoint detection library) You can create a set.

The dataset requires the following attributes:

(Roughly speaking, the dataset defines a rectangle containing an object in boxes, and masks defines whether it is an object in pixels.) If the model returns the above method, the model works for both training and evaluation, and the evaluation script of pycocotools is used.

Writing a custom dataset for Penn-Fudan (Creating a custom dataset for Penn-Fudan)

Let's output the dataset of Penn-Fudan dataset. First, https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip Download and unzip the zip file.

%%shell

# download the Penn-Fudan dataset
wget https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip .
# extract it in the current folder
unzip PennFudanPed.zip

The data has the following structure.

PennFudanPed/
  PedMasks/
    FudanPed00001_mask.png
    FudanPed00002_mask.png
    FudanPed00003_mask.png
    FudanPed00004_mask.png
    ...
  PNGImages/
    FudanPed00001.png
    FudanPed00002.png
    FudanPed00003.png
    FudanPed00004.png

Let's display the first image.

from PIL import Image
Image.open('PennFudanPed/PNGImages/FudanPed00001.png')

ダウンロード.png

(Although it is described in the unzipped readme.txt, the mask image is an image with a background of "0" and a label of 1 or more for each pedestrian.)

mask = Image.open('PennFudanPed/PedMasks/FudanPed00001_mask.png')
#Each mask instance has a different color from zero to N.
#Where N is the number of instances (pedestrians). For ease of visualization
#Let's add a color palette to the mask.
mask.putpalette([
    0, 0, 0, # black background
    255, 0, 0, # index 1 is red
    255, 255, 0, # index 2 is yellow
    255, 153, 0, # index 3 is orange
])
mask

ダウンロード.png

This data has a mask that identifies each image and pedestrian, and each color of the mask corresponds to an individual pedestrian. Let's create the torch.utils.data.Dataset class for this dataset.

import os
import numpy as np
import torch
import torch.utils.data
from PIL import Image


class PennFudanDataset(torch.utils.data.Dataset):
    def __init__(self, root, transforms=None):
        self.root = root
        self.transforms = transforms
        #Load and sort all image files
        self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
        self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))

    def __getitem__(self, idx):
        #Load images and masks
        img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
        mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
        img = Image.open(img_path).convert("RGB")
        #Because each color corresponds to a different instance and 0 is the background
        #Note that we have not converted the mask to RGB
        mask = Image.open(mask_path)

        mask = np.array(mask)
        #Instances are encoded as different colors
        obj_ids = np.unique(mask)
        #The first ID is the background, so delete it
        obj_ids = obj_ids[1:]

        # split the color-encoded mask into a set
        # of binary masks
        #Divide the color-coded mask into a set of binary masks
        masks = mask == obj_ids[:, None, None]

        # get bounding box coordinates for each mask
        #Gets the bounding box coordinates for each mask
        num_objs = len(obj_ids)
        boxes = []
        for i in range(num_objs):
            pos = np.where(masks[i])
            xmin = np.min(pos[1])
            xmax = np.max(pos[1])
            ymin = np.min(pos[0])
            ymax = np.max(pos[0])
            boxes.append([xmin, ymin, xmax, ymax])

        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        # there is only one class
        #There is only one class
        labels = torch.ones((num_objs,), dtype=torch.int64)
        masks = torch.as_tensor(masks, dtype=torch.uint8)

        image_id = torch.tensor([idx])
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        # suppose all instances are not crowd
        #Suppose all instances are not congested
        iscrowd = torch.zeros((num_objs,), dtype=torch.int64)

        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["masks"] = masks
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd

        if self.transforms is not None:
            img, target = self.transforms(img, target)

        return img, target

    def __len__(self):
        return len(self.imgs)

That's it for the dataset. Let's see how the output of this dataset is organized

dataset = PennFudanDataset('PennFudanPed/')
dataset[0]

out


(<PIL.Image.Image image mode=RGB size=559x536 at 0x7FC7AC4B62E8>,
 {'area': tensor([35358., 36225.]), 'boxes': tensor([[159., 181., 301., 430.],
          [419., 170., 534., 485.]]), 'image_id': tensor([0]), 'iscrowd': tensor([0, 0]), 'labels': tensor([1, 1]), 'masks': tensor([[[0, 0, 0,  ..., 0, 0, 0],
           [0, 0, 0,  ..., 0, 0, 0],
           [0, 0, 0,  ..., 0, 0, 0],
           ...,
           [0, 0, 0,  ..., 0, 0, 0],
           [0, 0, 0,  ..., 0, 0, 0],
           [0, 0, 0,  ..., 0, 0, 0]],
  
          [[0, 0, 0,  ..., 0, 0, 0],
           [0, 0, 0,  ..., 0, 0, 0],
           [0, 0, 0,  ..., 0, 0, 0],
           ...,
           [0, 0, 0,  ..., 0, 0, 0],
           [0, 0, 0,  ..., 0, 0, 0],
           [0, 0, 0,  ..., 0, 0, 0]]], dtype=torch.uint8)})

You can see that the dataset returns PIL.Image and a dictionary containing some fields such as boxes, labels, masks.

Although not in the tutorial, the following code can illustrate boxes and masks. boxes are rectangles containing instances (people), and masks are the instances themselves.

import matplotlib.pyplot as plt
import matplotlib.patches as patches

fig, ax = plt.subplots()

target = dataset[0][1]

#Masks of the first instance
masks_0 = target['masks'][0,:,:]

#Boxes of the first instance
boxes_0 = target['boxes'][0]

#Output mask
ax.imshow(masks_0)

#Output boxes
ax.add_patch(
     patches.Rectangle(
        (boxes_0[0], boxes_0[1]),boxes_0[2] - boxes_0[0], boxes_0[3] - boxes_0[1],
        edgecolor = 'blue',
        facecolor = 'red',
        fill=True,
        alpha=0.5
     ) )

plt.show()

ダウンロード.png

Defining your model

This tutorial uses MaskR-CNN, which is based on FasterR-CNN. Faster R-CNN is an object detection algorithm that predicts both the bounding box and class score of a potential object in an image (a rectangle containing the object and what the object is). (The image below is a processed image of Faster R-CNN)

Faster R-CNN

Mask R-CNN is an improved version of Faster R-CNN that judges not only the object detection by the rectangle (box) but also by the pixel unit (mask). (The image below is a processed image of Mask R-CNN)

Mask R-CNN

There are two main reasons for customizing a model with torchvision. The first is when you want to take advantage of a pre-trained model and tweak the last layer. The other is when you want to replace the model backbone with another backbone. (For example, for faster predictions) Let's take a concrete example.

1. Finetuning from a pretrained model

Here's how to use a pre-trained model to fine-tune to the class you want to identify.

import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

#Load pre-trained models at COCO
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

#Classifier, user-defined num_Replace with a new classifier with classes
num_classes = 2  # 1 class (person) + background :1 class (person)+background
#Gets the number of input features of the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
#Replace the pre-trained HEAD with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes) 

2. Modifying the model to add a different backbone (Modify the model to add a different backbone)

The other case is when you want to replace the model backbone with another backbone. For example, the current default backbone (ResNet-50) may be too large in some situations and you may want to take advantage of a smaller model. The following describes how to use torchvision to change the backbone.

import torchvision
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator

#Loads a pre-trained model for classification and returns only features
backbone = torchvision.models.mobilenet_v2(pretrained=True).features
#FasterRCNN needs to know the number of output channels in the backbone.
# mobilenet_For v2 it's 1280, so you need to add it here
backbone.out_channels = 1280

#In RPN, with 5 different sizes and 3 different aspect ratios
#Let's generate 5 x 3 anchors for each spatial position.
#Because the size and aspect ratio of each feature map can be different
# Tuple [Tuple [int]]there is.
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                   aspect_ratios=((0.5, 1.0, 2.0),))

#With the feature map used to perform trimming of the area of interest,
#Let's define the size of the trim after rescaling.
#If the backbone returns a Tensor, featmap_names is[0]Is expected to be.
#More generally, the backbone is OrderedDict[Tensor]Must be returned,
# featmap_You can select the feature map to use with names.
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
                                                output_size=7,
                                                sampling_ratio=2)

#Put pieces together in a Faster RCNN model
model = FasterRCNN(backbone,
                   num_classes=2,
                   rpn_anchor_generator=anchor_generator,
                   box_roi_pool=roi_pooler)

An Instance segmentation model for PennFudan Dataset (PennFudan Dataset Instance Segmentation model)

In this case, the dataset is very small, so we'll tweak the pre-trained model. Therefore, follow approach number 1. We'll use Mask R-CNN here to also calculate the segmentation mask for the instance (to determine the area of the person in pixels).

import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor

      
def get_instance_segmentation_model(num_classes):
    #Load a COCO pre-trained instance segmentation model
    model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)

    #Gets the number of input features of the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    #Replace the pre-trained HEAD with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    #mask Gets the number of input features of the classifier
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    hidden_layer = 256
    # and replace the mask predictor with a new one
    #Replace the mask predictor with a new one
    model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
                                                       hidden_layer,
                                                       num_classes)

    return model

You are now ready to train and evaluate your model on this dataset.

(Comparing this model with torchvision.models.detection.maskrcnn_resnet50_fpn, you can see that the dimensions of the following parts have changed.)

  (roi_heads): RoIHeads(
・ ・ ・
    (box_predictor): FastRCNNPredictor(
      (cls_score): Linear(in_features=1024, out_features=2, bias=True)
      (bbox_pred): Linear(in_features=1024, out_features=8, bias=True)
    )
・ ・ ・
    (mask_predictor): MaskRCNNPredictor(
・ ・ ・
      (mask_fcn_logits): Conv2d(256, 2, kernel_size=(1, 1), stride=(1, 1))
    )
  )

Training and evaluation functions

The torchvision vision / references / detection / has a number of helper functions to simplify training and evaluation of object detection models. Here we use references / detection / engine.py, references / detection / utils.py, references / detection / transforms.py.

Copy these files (and their associated files) for use.

%%shell

# Download TorchVision repo to use some files from
# references/detection
git clone https://github.com/pytorch/vision.git
cd vision
git checkout v0.3.0

cp references/detection/utils.py ../
cp references/detection/transforms.py ../
cp references/detection/coco_eval.py ../
cp references/detection/engine.py ../
cp references/detection/coco_utils.py ../

Let's use the copied refereces / detection to create some helper functions for data extension / transformation.

from engine import train_one_epoch, evaluate
import utils
import transforms as T


def get_transform(train):
    transforms = []
    #Convert image to Tensor
    transforms.append(T.ToTensor())
    if train:
        #For training, the image and teacher data are randomly flipped horizontally. (Image reflected in the mirror)
        transforms.append(T.RandomHorizontalFlip(0.5))
    return T.Compose(transforms)

The above code is the preparation of the data. Converts the image to a Tensor and inverts it randomly for training data. No data standardization or image rescaling is required. The Mask R-CNN model does it internally.

Putting everything together

The dataset, model, and data preparation are now ready. Let's instantiate them.

#Uses a dataset and a defined transformation
dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
dataset_test = PennFudanDataset('PennFudanPed', get_transform(train=False))

#Split the dataset with training and test data
torch.manual_seed(1)
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-50])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])

#Define a training and validation data loader
data_loader = torch.utils.data.DataLoader(
    dataset, batch_size=2, shuffle=True, num_workers=4,
    collate_fn=utils.collate_fn)

data_loader_test = torch.utils.data.DataLoader(
    dataset_test, batch_size=1, shuffle=False, num_workers=4,
    collate_fn=utils.collate_fn)

Instantiate the model.

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

#Teacher data is only two classes, background and person
num_classes = 2

#Get the model using the helper function
model = get_instance_segmentation_model(num_classes)
#Move the model to the appropriate device
model.to(device)

#Build the optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
                            momentum=0.9, weight_decay=0.0005)

#Learning rate scheduler that reduces the learning rate to 1/10 every 3 epochs
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                               step_size=3,
                                               gamma=0.1)

Train with 10 epochs. Evaluate with the evaluate function at each epoch. (It takes about 8 minutes to learn in the GPU environment of Colaboratory. A run-time error occurs in the None GPU.)

#Training with 10 epochs
num_epochs = 10

for epoch in range(num_epochs):
    print(epoch)
    #1 Epoch training
    train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
    #Update learning rate
    lr_scheduler.step()
    #Evaluate with test dataset
    evaluate(model, data_loader_test, device=device)

out


...
Averaged stats: model_time: 0.1179 (0.1174)  evaluator_time: 0.0033 (0.0051)
Accumulating evaluation results...
DONE (t=0.01s).
Accumulating evaluation results...
DONE (t=0.01s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.831
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.990
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.955
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.543
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.841
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.386
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.881
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.881
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.787
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.887
IoU metric: segm
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.760
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.990
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.921
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.492
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.771
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.345
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.808
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.808
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.725
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.814

Now that the training is over, let's see what the test dataset will look like.

#Select one image from the test set
img, _ = dataset_test[4]
#Put the model in evaluation mode
model.eval()
with torch.no_grad():
    prediction = model([img.to(device)])

When you output the prediction, it is a list of dictionaries. Since we specified one test data, the example below has one element in the list. The dictionary contains image predictions. In this case, you can see that it contains boxes, labels, masks, and scores.

prediction

out


[{'boxes': tensor([[173.1167,  27.6446, 240.8375, 313.0114],
          [325.5737,  64.3967, 453.1539, 352.3020],
          [222.4494,  24.5255, 306.5306, 291.5595],
          [296.8205,  21.3736, 379.0592, 263.7513],
          [137.4137,  38.1588, 216.4886, 276.1431],
          [167.8121,  19.9211, 332.5648, 314.0146]], device='cuda:0'),
  'labels': tensor([1, 1, 1, 1, 1, 1], device='cuda:0'),
  'masks': tensor([[[[0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            ...,
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.]]],
  
  
          [[[0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            ...,
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.]]],
  
  
          [[[0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            ...,
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.]]],
  
  
          [[[0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            ...,
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.]]],
  
  
          [[[0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            ...,
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.]]],
  
  
          [[[0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            ...,
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.]]]], device='cuda:0'),
  'scores': tensor([0.9965, 0.9964, 0.9942, 0.9696, 0.3053, 0.1552], device='cuda:0')}]

Check the image and the prediction result. The image (img) is a [color, vertical, horizontal] Tensor. Colors are 0 -1 so scale to 0-255 and swap with [Vertical, Horizontal, Color].

Image.fromarray(img.mul(255).permute(1, 2, 0).byte().numpy())

ダウンロード.png

Next, visualize the predicted mask. masks are predicted as [N, 1, H, W]. Where N is the predicted number of instances (people). Each value of mask stores the probability of determining "person" in pixels as 0-1.

Image.fromarray(prediction[0]['masks'][0, 0].mul(255).byte().cpu().numpy())

ダウンロード.png

(Other predicted instances (people) can also be visualized by changing the value of N as shown below.)

Image.fromarray(prediction[0]['masks'][1, 0].mul(255).byte().cpu().numpy())

ダウンロード.png

Image.fromarray(prediction[0]['masks'][2, 0].mul(255).byte().cpu().numpy())

ダウンロード.png

Image.fromarray(prediction[0]['masks'][3, 0].mul(255).byte().cpu().numpy())

ダウンロード.png

I can predict it well.

Wrapping up (Summary)

In this tutorial, you learned how to train an object detection model using a dataset you defined yourself. For the dataset, we created the torch.utils.data.Dataset class that holds the box and mask to define the dataset specific to object detection. We also leveraged the MaskR-CNN model pre-trained at COCOtrain 2017 to perform transfer learning on this new dataset.

For more detailed examples of multi-machine / multi-GPU training, see references / detection / train at torchvision GitHub repo Check the .py.

At the end

In this tutorial, we learned "transfer learning" and "fine tuning" using a pre-trained model. (This time it's apparently called fine tuning, and the difference between transfer learning and fine tuning will be explained next time.) In the tutorial, I tried with 120 training data and 50 verification data, but even with about 40 training data, I was able to predict fairly correctly. Transfer learning is amazing to be able to learn with such a small amount of test data. Next time, I would like to proceed with "Transfer Learning for Computer Vision Tutorial".

History

2020/11/15 First edition released

Recommended Posts

[PyTorch Tutorial ⑧] Torch Vision Object Detection Finetuning Tutorial
Computer Vision: Object Detection --Non Maximum Suppression
Computer Vision: Object Detection Part2-Single Shot Multi Detector
Computer Vision: Object Detection Part1 --Bounding Box preprocessing
[PyTorch Tutorial ①] What is PyTorch?