[PYTHON] Calculation of odometry using CNN and depth estimation Part 2 (CNN SLAM # 2)

This is a continuation of Calculation of odometry using CNN and depth estimation, Part 1. Last time, I described the content of the paper to the best of my knowledge. I wrote the calculation part of odometry in Python, so I will talk about it in the second part of this time.

Calculation of odometry Revisited

r({\bf u}, {\bf T}) = I_{k_i}({\bf u}) - I_t(\pi({\bf K}{\bf T}_{t}^{k_i} V_{ki}({\bf u})))

We aim to minimize $ r $ in the above equation. Find $ T $ (rotational component + translational component) at this time. This is divided into the rotation matrix $ R $ and the translation vector $ t $, and is expressed as follows.

r({\bf u}, {\bf T}) = I_{k_i}({\bf u}) - I_t(\pi({\bf K}({\bf R}_{t}^{k_i} V_{ki}({\bf u}) + {\bf t}_{t}^{k_i})))

Since there are three parameters of the rotation matrix, create a function that returns the rotation matrix with np.array ([alpha, beta, gamma]) as an argument. Use the numpy module.

util.py


def make_R(rads):
    if len(rads) != 3:
        print("len(rads) != 3")
    alpha, beta, gamma = rads
    R_alpha = np.array([[np.cos(alpha), 0.0, np.sin(alpha)],
                        [0.0, 1.0, 0.0],
                        [-np.sin(alpha), 0.0, np.cos(alpha)]])
    R_beta = np.array([[1.0, 0.0, 0.0],
                       [0.0, np.cos(beta), -np.sin(beta)],
                       [0.0, np.sin(beta), np.cos(beta)]])
    R_gamma = np.array([[np.cos(gamma), -np.sin(gamma), 0.0],
                        [np.sin(gamma), np.cos(gamma), 0.0],
                        [0.0, 0.0, 1.0]])
    R = np.dot(R_gamma, np.dot(R_beta, R_alpha))
    return R

$ K $ is an internal parameter of the camera. It is expressed as follows using the focal length of the lens ($ f_x $, $ f_y ) and the center coordinates of the image ( c_x $, $ c_y $).

K = \left(
  \begin{array}{c}
    f_x & 0 & c_x \\
    0 & f_y & c_y \\
    0 & 0 & 1
  \end{array}
\right)

util.py


fx = 525.0  # focal length x
fy = 525.0  # focal length y
cx = 319.5  # optical center x
cy = 239.5  # optical center y

K = np.array([[fx, 0, cx],
              [0, fy, cy],
              [0, 0, 1]], dtype=np.float64)

You can calibrate it, or I downloaded the dataset and experimented with it, so I used that information.

$ V $ is expressed by the following formula.

V_{k_i}({\bf u}) = K^{-1}\dot{u}D_{k_i}({\bf u})

I don't know what the specific value of $ \ dot {u} $ is, but what I want to do is to use the depth image $ D $ to see what is reflected in each pixel of the image in the 3D coordinate system. That is, I want to calculate where it is in the real world.

$ I $ represents the rgb image. The image will be a two-dimensional array (because we are dealing with a grayscale image this time), and it must be an integer type when referencing the value, but when performing coordinate conversion, the points in that coordinate system will be decimal. It's quite possible, so make a little effort to avoid throwing an error. For example, if you want the value of index 5.3, just mix and return the values of index 5 and index 6 appropriately. This time we will use linear interpolation.

util.py


def get_pixel(img, x, y):
    rx = x - np.floor(x)
    ry = y - np.floor(y)
    left_up_x = int(np.floor(x))
    left_up_y = int(np.floor(y))
    val = (1.0 - rx) * (1.0 - ry) * img[left_up_y, left_up_x] + \
          rx * (1.0 - ry) * img[left_up_y, left_up_x + 1] + \
          (1.0 - rx) * ry * img[left_up_y + 1, left_up_x] + \
          rx * ry * img[left_up_y + 1, left_up_x + 1]
    return val

The function to find $ r $ is as follows.

util.py


def translate(rads, t, im_xs, im_ys, dep_vals):
    if not (len(im_xs) == len(im_ys) == len(dep_vals)):
        print("len(im_xs) == len(im_ys) == len(dep_vals) is False!!")
        raise ValueError

    n_data = len(im_xs)
    U = np.vstack((im_xs, im_ys, [1.0] * n_data))

    R = make_R(rads)
    invK = np.linalg.inv(K)

    invK_u = np.dot(invK, U)
    R_invK_u = np.dot(R, invK_u)
    s_R_invK_u_t = dep_vals * R_invK_u + t
    K_s_R_invK_u_t = np.dot(K, s_R_invK_u_t)
    translated_u = K_s_R_invK_u_t / K_s_R_invK_u_t[2, :]

    return translated_u


def r(rads, t, im_xs, im_ys, dep_vals, I1, I2):
    if not (len(im_xs) == len(im_ys) == len(dep_vals)):
        print("len(im_xs) == len(im_ys) == len(dep_vals) is False!!")
        raise ValueError

    n_data = len(im_xs)
    transed_u = translate(rads=rads, t=t, im_xs=im_xs, im_ys=im_ys, dep_vals=dep_vals)

    diff_arr = np.empty((n_data, 1))
    for i in range(n_data):
        im_x1, im_y1 = im_xs[i], im_ys[i]
        im_x2, im_y2 = transed_u[0, i], transed_u[1, i]
        val1 = get_pixel(img=I1, x=im_x1, y=im_y1)  # I[im_y1, im_x1]
        val2 = get_pixel(img=I2, x=im_x2, y=im_y2)  # I[im_y2, im_x2]
        diff_arr[i, 0] = val1 - val2

    return diff_arr

This function calculates the error when converting from image I1 to image I2. rads is the parameter of the rotation component and t is the parameter of the translation component. This time, we will approach the optimum value by first giving an appropriate initial value to the parameter to be optimized and then gradually correcting the value. It is the same style as the steepest descent method. However, this time we will use the Gauss-Newton method. In this case, you will need Jacobian $ J $.

J = \left(
  \begin{array}{c}
    \frac{\partial {\bf r}_1}{\partial {\bf x}_1} & \ldots & \frac{\partial {\bf r}_1}{\partial {\bf x}_n} \\
    \vdots & \ddots & \vdots \\
    \frac{\partial {\bf r}_m}{\partial {\bf x}_1} & \ldots & \frac{\partial {\bf r}_m}{\partial {\bf x}_n}
  \end{array}
\right)

This time, it ranges from $ x_1 $ to $ x_6 $, which correspond to $ \ alpha, \ beta, \ gamma, x, y, z $, respectively. The number of $ r $ corresponds to the number of pixels in the image. When using a 640x480 image, there is also 640x480 = 307200, but using such a large matrix will slow down the calculation. Even in the image, there is no gradient in the part without a white wall or pattern, so the Jacobian component in that part tends to be 0 and it is not useful for optimization, so it is more efficient to narrow down to the part with a gradient and find $ J $. Is good. So, get a list of image coordinates with a gradient and use only that point cloud to find the rotation and translational components. This is why the r function in util.py has arguments im_xs, im_ys, dep_vals. Even in the paper, it seems that the calculation is focused on the part with a large gradient. If you calculate using all the pixels, PC fans will start to work hard and become anxious.

Create a Jacobian source using the r function.

util.py


rad_eps = np.pi / 900.0
t_eps = 1e-04

def grad_r(rads, t, im_xs, im_ys, dep_vals, I_transed, index=-1):
    if not (len(im_xs) == len(im_ys) == len(dep_vals)):
        print("len(im_xs) == len(im_ys) == len(dep_vals) is False!!")
        raise ValueError

    I = I_transed

    if index < 0 or 5 < index:
        print("index is out of range or not defined")
        raise ValueError

    n_data = len(im_xs)

    if index < 3:
        rads_p, rads_m = rads.copy(), rads.copy()
        rads_p[index] += rad_eps
        rads_m[index] -= rad_eps
        u_p = translate(rads=rads_p, t=t, im_xs=im_xs, im_ys=im_ys, dep_vals=dep_vals)
        u_m = translate(rads=rads_m, t=t, im_xs=im_xs, im_ys=im_ys, dep_vals=dep_vals)
    else:
        t_p, t_m = t.copy(), t.copy()
        t_p[index - 3, 0] += t_eps
        t_m[index - 3, 0] -= t_eps
        u_p = translate(rads=rads, t=t_p, im_xs=im_xs, im_ys=im_ys, dep_vals=dep_vals)
        u_m = translate(rads=rads, t=t_m, im_xs=im_xs, im_ys=im_ys, dep_vals=dep_vals)

    grad = np.empty(n_data, )
    for i in range(n_data):
        im_x_p, im_y_p = u_p[0, i], u_p[1, i]
        im_x_m, im_y_m = u_m[0, i], u_m[1, i]
        val_p = get_pixel(img=I, x=im_x_p, y=im_y_p)  # I[im_y_p, im_x_p]
        val_m = get_pixel(img=I, x=im_x_m, y=im_y_m)  # I[im_y_m, im_x_m]
        grad[i] = -(val_p - val_m)

    if index < 3:
        grad /= (2.0 * rad_eps)
    else:
        grad /= (2.0 * t_eps)

    return grad

The code is fairly appropriate, but it's not designed to be published, but it looks like this. The variable I_transed corresponds to I2 in the r function. It means the image of the conversion destination. The variable index sets which variable from $ x_1 $ to $ x_6 $ to differentiate. The Jacobian can be obtained by calling the function a total of 6 times for each parameter.

main.py


from util import *

J_T = None  #Jacobian transpose matrix
for ind in range(6):
    grad = grad_r(rads=rads, t=t, im_xs=xs, im_ys=ys, dep_vals=dep_vals, I_transed=img2, index=ind)
    if J_T is None:
        J_T = np.copy(grad)
    else:
        J_T = np.vstack((J_T, grad))

In fact, we get the Jacobian transpose. Is no problem. To update the parameters, just let numpy perform the matrix operation. I can't get up to numpy.

main.py


JJ = np.dot(J_T, J_T.T)
invJJ = np.linalg.inv(JJ)
invJJ_J = np.dot(invJJ, J_T)
invJJ_J_r = np.dot(invJJ_J, diff_arr)

rads -= invJJ_J_r[:3, 0]
t -= invJJ_J_r[3:, :]

I didn't say it in the above explanation, but to find the part with a large gradient in the image, apply the Laplacian filter. I used the cv2 module around here. It's faster to look at the code, so I'll post it all.

main.py


from PIL import Image
import numpy as np
import cv2
from util import *

WIDTH = 640
HEIGHT = 480

data_dir = "XXX/living_room_traj2_frei_png/"

# 1 ->Ask for conversion to 2
image_file_1 = data_dir + "rgb/10.png "
depth_file_1 = data_dir + "depth/10.png "
image_file_2 = data_dir + "rgb/30.png "

# get image
img1 = Image.open(image_file_1)
img1 = img1.resize([WIDTH, HEIGHT], Image.ANTIALIAS)
raw_img1 = cv2.cvtColor(np.array(img1), cv2.COLOR_BGR2GRAY)
img1 = raw_img1.astype('float64') / 255.0

dep1 = Image.open(depth_file_1)
dep1 = dep1.resize([WIDTH, HEIGHT], Image.ANTIALIAS)
dep1 = np.array(dep1, 'float64') / 5000.0

img2 = Image.open(image_file_2)
img2 = img2.resize([WIDTH, HEIGHT], Image.ANTIALIAS)
raw_img2 = cv2.cvtColor(np.array(img2), cv2.COLOR_BGR2GRAY)
img2 = raw_img2.astype('float64') / 255.0

#Get a list of coordinates and depth values on the image to convert
lap1 = np.abs(cv2.Laplacian(raw_img1, cv2.CV_32F, ksize=5))
th = sorted(lap1.flatten())[::-1][2000]

xs, ys, dep_vals = list(), list(), list()
bias = 30
for y in range(bias, HEIGHT - bias):
    for x in range(bias, WIDTH - bias):
        if lap1[y, x] > th:
            xs.append(x)
            ys.append(y)
            dep_vals.append(dep1[y, x])

xs = np.array(xs, dtype=np.float64)
ys = np.array(ys, dtype=np.float64)
dep_vals = np.array(dep_vals, dtype=np.float64)

#Find the transformation matrix based on the Gauss-Newton algorithm
#Initial value setting
rads = np.array([0.0, 0.0, 0.0]).reshape(3, )
t = np.array([0.0, 0.0, 0.0]).reshape(3, 1)

#For the time being, loop 10 times
for _ in range(10):
    diff_arr = r(rads=rads, t=t, im_xs=xs, im_ys=ys, dep_vals=dep_vals, I1=img1, I2=img2)

    J_T = None  #Jacobian transpose matrix
    for ind in range(6):
        grad = grad_r(rads=rads, t=t, im_xs=xs, im_ys=ys, dep_vals=dep_vals, I_transed=img2, index=ind)
        if J_T is None:
            J_T = np.copy(grad)
        else:
            J_T = np.vstack((J_T, grad))

    JJ = np.dot(J_T, J_T.T)
    invJJ = np.linalg.inv(JJ)
    invJJ_J = np.dot(invJJ, J_T)
    invJJ_J_r = np.dot(invJJ_J, diff_arr)

    rads -= invJJ_J_r[:3, 0]
    t -= invJJ_J_r[3:, :]

    print(rads.reshape(-1))
    print(t.reshape(-1))
    print("-----")

#How each pixel of img1 moves by rotation matrix and translation vector,
#Estimate img1 using img2 obtained at the destination
out = transform_image(rads=rads, t=t, dep_img=dep1, I=img2)
out = (255.0 * out).astype(np.uint8)

cv2.imshow('output', cv2.hconcat((raw_img1, raw_img2, out)))
cv2.waitKey(0)

util.py


def transform_image(rads, t, dep_img, I):
    im_xs, im_ys = np.meshgrid(np.arange(WIDTH), np.arange(HEIGHT))
    im_xs = im_xs.reshape(-1)
    im_ys = im_ys.reshape(-1)
    dep_vals = dep_img.reshape(-1)
    transed_u = translate(rads=rads, t=t, im_xs=im_xs, im_ys=im_ys, dep_vals=dep_vals)

    out = np.zeros((HEIGHT, WIDTH))
    for i in range(len(im_xs)):
        im_x1, im_y1 = im_xs[i], im_ys[i]
        im_x2, im_y2 = transed_u[0, i], transed_u[1, i]
        try:
            out[im_y1, im_x1] = get_pixel(img=I, x=im_x2, y=im_y2)
        except:
            out[im_y1, im_x1] = 0.0

    return out

test

・ Conversion source image (img1) 10.png

・ Conversion destination image (img2) 30.png

-Restore the conversion source image from the estimated parameters (grayscale) output.png

It seems that the tilt of the sofa is corrected, so it is not perfect, but it seems that the rotation component and translation component can be calculated as it is. This time, I used ICL-NUIM RGB-D dataset. I haven't tried it on other datasets, so make sure this implementation is okay.

At the end

This time, I didn't estimate the depth, so it looks like a title scam. In the paper, the depth is estimated. I also want to estimate it soon.

Recommended Posts

Calculation of odometry using CNN and depth estimation Part 2 (CNN SLAM # 2)
Depth estimation by CNN (CNN SLAM # 3)
Benefits and examples of using RabbitMq
Calculation of normal vector using convolution
Calculation of homebrew class and existing class
Trajectory estimation simulation using Graph-Based SLAM
Similarity calculation between episodes of Precure using live timeline and topic model
Python: Basics of image recognition using CNN
Advantages and disadvantages of maximum likelihood estimation
Example of using class variables and class methods
Python: Application of image recognition using CNN
I tried using PyEZ and JSNAPy. Part 2: I tried using PyEZ
Image recognition using CNN Horses and deer