[PYTHON] Affine transformation by OpenCV (CUDA)

Motivation

See Past Articles

Summary

--Affine transformation was performed with OpenCV on Python --Conclusion --CUDA hurray. --OpenCV multithreading is amazing.

environment

code

Preprocessing

First, load various things and then get 200 * 200 maharo.jpg. This is the otter [Mahalo] in the Sunshine Aquarium (https://wikiwiki.jp/kawausotter/%E3%83%9E%E3%83%8F%E3%83%AD%EF%BC%88%E3% 82% B5% E3% 83% B3% E3% 82% B7% E3% 83% A3% E3% 82% A4% E3% 83% B3% E6% B0% B4% E6% 97% 8F% E9% A4% It is an image of A8% EF% BC% 89). See Past Articles for details about this.

import cv2 as cv
import numpy as np
from matplotlib import pyplot as plt
print('Enabled CUDA devices:',cv.cuda.getCudaEnabledDeviceCount()) # 1

src = cv.cvtColor(cv.imread("maharo.jpg "),cv.COLOR_BGR2RGB)
h, w, c = src.shape #200, 200, 3
plt.subplot(111),plt.imshow(src)
plt.title('otter image'), plt.xticks([]), plt.yticks([])
plt.show()

g_src = cv.cuda_GpuMat()
g_dst = cv.cuda_GpuMat()
g_src.upload(src)

image.png First, define the matrix of affine transformation. By combining rotation around the center and translation, the center axis is rotated by scale times rot.

def get_rot_affine(src, rot, scale):
    h, w = src.shape[::-1]
    rot_affine = cv.getRotationMatrix2D((h/2, w/2), rot, scale)
    rot_affine[:2,2] -= [h/2, w/2]
    rot_affine[:2,2] += [h/2*scale, w/2*scale]
    return rot_affine

1x 20 degree rotation

The affine matrix is defined as follows for the same magnification 20 degree rotation. By the way, CUDA does not support Lanczos interpolation.

rot_affine = get_rot_affine(src, 20, 1)

CPU First, try it on the CPU.

%%timeit
img_dst = cv.warpAffine(src, rot_affine, (w*1, h*1), flags=cv.INTER_CUBIC)
# 1.08 ms ± 7.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

image.png

CUDA Let's experience the taste of GPU materials with processing only that does not include reading / writing information in the loop. As usual, cv2.cuda.wrapAffine is used, but the arguments are only the source image src and the output image dst on the GPU, and the affine matrix and dst size are the same as the normal cv.wrapAffine. Seems to use. (I get angry when I put it on the GPU).

%%timeit
g_dst = cv.cuda.warpAffine(g_src, rot_affine, (w*1, h*1), flags=cv.INTER_CUBIC)
# 383 µs ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Well, it's like this ... After all, if the image is small, it seems to be only about 2 to 3 times more effective.

%%timeit
g_src.upload(src)
g_dst = cv.cuda.warpAffine(g_src, rot_affine, (w*1, h*1), flags=cv.INTER_CUBIC)
gpu_dst = g_dst.download()
# 450 µs ± 5.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

image.png

Even if you consider reading / writing, the speed does not change that much. (Because the image is light)

10 times 20 degrees rotation

Make Mahalo 10 times (2K) and rotate it 20 degrees. The affine matrix is called as follows.

rot_affine = get_rot_affine(src, 20, 10)

CPU First, try it on the CPU.

%%timeit
img_dst = cv.warpAffine(src, rot_affine, (w*10, h*10), flags=cv.INTER_CUBIC)
# 7.29 ms ± 36.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It's pretty fast for a 10x magnification. This is environment-dependent, parallelization did not work well when processing 200 px images, and the CPU was running with 1 thread, but when processing 2K images, parallelization of 32 threads worked properly. It depends on the speedup. (It will be slower if you do it with a normal Core i5 / 7, etc. By the way, 2 * Xeon E5-2667 v3 @ 3.20 GHz in the verification environment is about 1 * Ryzen 9 3900 XT according to Cinebench. It's the era ...)

image.png

CUDA

%%timeit
g_dst = cv.cuda.warpAffine(g_src, rot_affine, (w*10, h*10), flags=cv.INTER_CUBIC)
# 1.42 ms ± 16.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

It is about 5 times faster than the CPU. Well, it's pretty easy to understand and fast unless the other party is Xeon, but it seems that GTX 1080 is a bit bad.

%%timeit
g_src.upload(src)
g_dst = cv.cuda.warpAffine(g_src, rot_affine, (w*10, h*10), flags=cv.INTER_CUBIC)
gpu_dst = g_dst.download()
# 2.64 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

When importing / exporting, it takes a little time because the image is large.

image.png

The code used this time (Jupyter Notebook) is published on Github.

reference

Completely understand affine transformation

Recommended Posts

Affine transformation by OpenCV (CUDA)
Template matching by OpenCV (CUDA)
Affine transformation
Application of affine transformation by tensor-from basic to object detection-
[FSL] Affine transformation (translation + linear transformation)
[Python] Using OpenCV with Python (Image transformation)
Think about transformation rock-paper-scissors by optimization
Accelerate read by specifying OpenCV frame
Horizon processing using OpenCV morphology transformation
Object recognition with openCV by traincascade
Reversi board surface recognized by OpenCV