[PYTHON] PointNet theory and implementation (point cloud data)

Introduction

This is the explanation of ** PointNet **, which is the most basic deep learning model for ** point cloud data **. In order to understand PointNet, it is necessary to understand the point cloud data, so we will first explain the point cloud data, then explain the theory of PointNet and implement it with ** PyTorch **.

In addition, as a simple experiment using PointNet, we will perform a ** binary classification task ** that samples 3D data from a uniform distribution and a normal distribution and guesses from which distribution the data was sampled.

The PointNet paper is here. The implemented code is posted on GitHub.

What is point cloud data (Point Cloud)?

Point cloud data is data that has been attracting attention in autonomous driving in recent years. This is because the data obtained from a sensor called LiDAR used in autonomous driving is point cloud data. In addition, it is used in three-dimensional measurement in the construction industry and chemical calculations for molecules, and its application range is wide-ranging.

Although point cloud data has such a wide range of applications, it has three major properties **, and it is necessary to consider the following three properties when handling point cloud data in machine learning.

<img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/302831/9e731cee-f997-6f85-b7c1-ac37a025ba5c.png ", width="35%">

Invariant

Invariance is the property that the output is invariant even if the permutations of the point cloud data are exchanged and input to the machine learning model.

For example, in the case of an image, it is possible to order each pixel from the upper left to the lower right. However, point cloud data does not allow you to order each point, so each time you enter into a machine learning model, the point cloud is entered in a different permutation. At this time, the machine learning model is required to output the same value (invariant) every time for point cloud inputs in different permutations. In other words, satisfying the following equation is a condition of invariance. $ f(\boldsymbol x_1, \boldsymbol x_2, ..., \boldsymbol x_M)= f(\boldsymbol x_{\pi(1)}, \boldsymbol x_{\pi(2)}, ..., \boldsymbol x_{\pi(M)}) $ Where $ \ boldsymbol x_m $ is the 3D coordinates of the $ m $ th point, and $ \ pi $ represents any sort. Since it is difficult to understand from the formula alone, it is as follows in the figure. <img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/302831/992734b8-caa4-a89b-d74b-bd26f6e55751.png ", width="65%"> As I will explain in detail later, PointNet satisfies the order invariance by using a function called Max Pooling.

Invariant

Movement invariance is the property that the output is invariant even if point cloud data with translation or rotational movement is input to the machine learning model. Actually, this property is not unique to point cloud data, and images have the same property. The translation invariance is expressed by the following equation. $ f(\boldsymbol x_1+\boldsymbol r, \boldsymbol x_2+\boldsymbol r, ..., \boldsymbol x_M+\boldsymbol r)= f(\boldsymbol x_1, \boldsymbol x_2, ..., \boldsymbol x_M) $ The input $ \ boldsymbol x_m $ is translated by $ \ boldsymbol r , but the output has not changed. Next, the equation for rotational movement invariance is as follows. $ f(R\boldsymbol x_1, R\boldsymbol x_2, ..., R\boldsymbol x_M)= f(\boldsymbol x_1, \boldsymbol x_2, ..., \boldsymbol x_M) $$

You can see that the output is invariant with respect to the rotational movement of the input data.

As will be explained in detail later, PointNet approximately acquires movement invariance by applying affine transformation (parallel / rotational movement) to the input point group. However, ** I am not strictly satisfied with the movement invariance **. In order to strictly satisfy the movement invariance, for example, a method of using the distance between two points as a feature can be considered. (Even if two points are translated or rotated, the distance is constant. Therefore, the movement invariance can be strictly satisfied by using the distance between the two points as a feature. SchNet and [HIP-NN](https://aip. I use that method in a chemistry paper called scitation.org/doi/abs/10.1063/1.5011181).)

Locality

Locality is the property that ** points that are spatially close to each other have some kind of close relationship, and points that are spatially distant are less related to each other **. This property is not unique to point clouds, and images and the like also have this property. If it is an image, locality can be satisfied by using a convolution layer. (Actually, PointNet is not satisfied with the locality. The model that overcomes the biggest drawback of PointNet is PointNet ++. -on-point-sets-in-a-metric-space).)

PointNet theory

First, I will show you the architecture of PointNet. The blue part is the Classification Network, and the yellow part is the Segmentation Network. As the name suggests, Classification and Segmentation are used according to the purpose. This time I'll just explain the blue Classification Network, but that's where the essence of PointNet lies. <img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/302831/01a21a0e-8582-98d4-e902-fb348b9482cb.png ", width="100%"> As a flow of Classification Network, first, the input point group is subjected to affine transformation by ** input transform **, and the movement invariance is approximately acquired. Next, the point cloud after the affine transformation is processed by the neural network, and the affine transformation is performed again by the feature transform. Then, it is processed by a neural network, and finally ** Max Pooling ** is used to acquire invariance and obtain output.

Max Pooling The most important part of PointNet ** is Max Pooling **. Max Pooling is a very simple function, ** a function that outputs the largest of the input elements **. For example, if the input element is {0, 1, 2, 3}, the output of Max Pooling will be 3, which is the maximum element. $ \rm{MaxPooling}(0, 1, 2, 3)=3 $ You can see that the output does not change even if the input elements are reordered and passed through Max Pooling. $ \rm{MaxPooling}(1, 0, 3, 2)=3 $ From this, it can be seen that Max Pooling satisfies the order invariance. PointNet has acquired ** invariance ** by using Max Pooling at the end of the network.

Input transform Input transform (feature transform) moves the input point cloud in parallel and rotation by applying ** affine matrix ** to the input point cloud, and obtains ** movement invariance ** approximately. Although it is an affine matrix, it can be obtained as the output of ** T-Net **. T-Net has a structure like a mini Point-Net and consists of a combination of neural network and Max Pooling. If you input a 3D point cloud to this T-Net, you will get an affine matrix as an output.

Implementation (PyTorch)

Implementation of Input transform (T-Net)

As explained earlier, T-Net is a network that takes a 3D point cloud as an input and outputs an affine matrix.

As shown below, the non-linear transformation by the neural network (Non Linear) is repeated, Max Pooling is sandwiched in the middle, and finally the (9 × 1) size Tensor is output. Resize this output to (3x3) to get the affine matrix. Then, the matrix product of the obtained affine matrix and the input data is calculated and passed to the next layer.

Note that the feature transform that performs affine transformation on the feature amount in the middle of PointNet is almost the same, so the explanation is omitted.

model.py


class InputTNet(nn.Module):
    def __init__(self, num_points):
        super(InputTNet, self).__init__()
        self.num_points = num_points
        
        self.main = nn.Sequential(
            NonLinear(3, 64),
            NonLinear(64, 128),
            NonLinear(128, 1024),
            MaxPool(1024, self.num_points),
            NonLinear(1024, 512),
            NonLinear(512, 256),
            nn.Linear(256, 9)
        )
        
    # shape of input_data is (batchsize x num_points, channel)
    def forward(self, input_data):
        matrix = self.main(input_data).view(-1, 3, 3)
        out = torch.matmul(input_data.view(-1, self.num_points, 3), matrix)
        out = out.view(-1, 3)
        return out

By the way, NonLinear is a self-made function that summarizes Dense, ReLU, and Batch Normalization.

model.py


class NonLinear(nn.Module):
    def __init__(self, input_channels, output_channels):
        super(NonLinear, self).__init__()
        self.input_channels = input_channels
        self.output_channels = output_channels

        self.main = nn.Sequential(
            nn.Linear(self.input_channels, self.output_channels),
            nn.ReLU(inplace=True),
            nn.BatchNorm1d(self.output_channels))

    def forward(self, input_data):
        return self.main(input_data)

Implementation of the entire PointNet

PointNet has a structure of input → T-Net → NN → T-Net → NN → Max Pool → NN → output. So I'll just drop this into the code.

model.py


class PointNet(nn.Module):
    def __init__(self, num_points, num_labels):
        super(PointNet, self).__init__()
        self.num_points = num_points
        self.num_labels = num_labels
        
        self.main = nn.Sequential(
            InputTNet(self.num_points),
            NonLinear(3, 64),
            NonLinear(64, 64),
            FeatureTNet(self.num_points),
            NonLinear(64, 64),
            NonLinear(64, 128),
            NonLinear(128, 1024),
            MaxPool(1024, self.num_points),
            NonLinear(1024, 512),
            nn.Dropout(p = 0.3),
            NonLinear(512, 256),
            nn.Dropout(p = 0.3),
            NonLinear(256, self.num_labels),
            )
        
    def forward(self, input_data):
        return self.main(input_data)

Experiment

As a simple experiment, I randomly sampled 3D points from a uniform distribution and a normal distribution, and used PointNet to predict which one was sampled from.

The function that samples from the probability distribution has the following implementation.

sampler.py


def data_sampler(batch_size, num_points):
    half_batch_size = int(batch_size/2)
    normal_sampled = torch.randn(half_batch_size, num_points, 3)
    uniform_sampled = torch.rand(half_batch_size, num_points, 3)
    normal_labels = torch.ones(half_batch_size)
    uniform_labels = torch.zeros(half_batch_size)

    input_data = torch.cat((normal_sampled, uniform_sampled), dim=0)
    labels = torch.cat((normal_labels, uniform_labels), dim=0)

    data_shuffle = torch.randperm(batch_size)
    
    return input_data[data_shuffle].view(-1, 3), labels[data_shuffle].view(-1, 1)

Using the PointNet implemented earlier and this sampling function, learning and evaluation are performed as follows. The batch size is 64 and the number of point clouds in one dataset is 16.

Although it is new_param, it sets the initial value of the bias of the final layer of TNet to be the identity matrix (flattened version). Such initialization is recommended in the paper.

main.py


batch_size = 64
num_points = 16
num_labels = 1

pointnet = PointNet(num_points, num_labels)
        
new_param = pointnet.state_dict()
new_param['main.0.main.6.bias'] = torch.eye(3, 3).view(-1)
new_param['main.3.main.6.bias'] = torch.eye(64, 64).view(-1)
pointnet.load_state_dict(new_param)

criterion = nn.BCELoss()
optimizer = optim.Adam(pointnet.parameters(), lr=0.001)

loss_list = []
accuracy_list = []

for iteration in range(100+1):
    
    pointnet.zero_grad()
    
    input_data, labels = data_sampler(batch_size, num_points)
    
    output = pointnet(input_data)
    output = nn.Sigmoid()(output)

    error = criterion(output, labels)
    error.backward()
    
    optimizer.step()
    
    if iteration % 10 == 0:
        with torch.no_grad():
            output[output > 0.5] = 1
            output[output < 0.5] = 0
            accuracy = (output==labels).sum().item()/batch_size
            
            loss_list.append(error.item())
            accuracy_list.append(accuracy)
            
        print('Iteration : {}   Loss : {}'.format(iteration, error.item()))
        print('Iteration : {}   Accuracy : {}'.format(iteration, accuracy))

The result is as follows:

I'm not sure if the task is too easy or PointNet is good, but you can see that it's categorized well. The code given here is a part of the whole, so if you want to know the whole, please see GitHub here.

Recommended Posts

PointNet theory and implementation (point cloud data)
Normalizing Flow theory and implementation
Point and Figure Data Modeling
Simple neural network theory and implementation
Open3D basics and point cloud voxels
Python data structure and internal implementation ~ List ~
Q-learning practice and theory
Perceptron basics and implementation
Point Cloud with Pepper
Theory and implementation of multiple regression models-why regularization is needed-