This time, we will introduce the paper "** DARK **" on the 2D posture estimation model from the latest papers adopted in 2020.
In the attitude estimation model, ** heatmap ** has become the standard for coordinate representation, but it has never been systematically studied.
The paper introduced this time is a paper that systematically researches the heat map and then proposes a new coordinate representation.
Heatmap example: From Stacked Hourglass Networks for Human Pose Optimization [^ 0] </ font>
: white_check_mark: * Achieved *** SOTA [^ 1](AP: 77.4) *** for COCO datasets! *
: white_check_mark: * Updated heatmap * ** encoding method ***! *
: white_check_mark: * Updated heatmap * ** decoding method ***! *
When training a classifier, it is common to use One-hot vectors, but for CNN models such as pose estimation, it is common to use heatmaps.
However, the heat map representation method has the ** drawback ** that the calculation cost increases with the square of the resolution of the input image **, so it is necessary to compress the image before passing it through the model. Also, for the same reason, the resolution of the output image is also small, so you need to restore the output image to the size of the original image.
This paper has taken a closer look at each of these ** compression (encoding) ** and ** restoration (decoding) **.
First, let's talk about the part that converts the joint points into a heat map. Let's start with the conventional heat map creation method.
Normally, downsampled coordinates ( blue dot </ font> → in the figure below) are used using the ** floor function ** (a function that truncates the decimal point). A heat map is created by rounding to the purple point </ font>) and creating a normal distribution centered on the coordinates. As you can see, this method does not produce an accurate heatmap.
On the other hand, the proposed method ** creates a normal distribution centered on the original coordinates **. This makes it possible to create more elaborate heat maps.
The actual accuracy is also clearly better. (** Unbiased ** is the proposed method) ↓
Next, we will explain the process of finding the position of the joint point (in the original image) from the output heat map. The encoding part was relatively easy, but the decoding part is a bit complicated.
First, I will explain the conventional decoding method. Currently, the mainstream decoding methods include those that use only the maximum value of the heatmap and those that ** take the overall weighted average **.
\bf p = \rm \frac{ \sum_{\it i \rm = 1}^{\it N} \it w_i \bf x_{\rm \it i}}{ \sum_{\it i \rm = 1}^{\it N} \it w_i}
$ \ Bf x $ is the heatmap pixel position and $ \ it w $ is the heatmap value.
On the other hand, in this proposed method, as shown in the figure below, improvement measures are proposed in three steps.
The output heat map often has a jagged shape near the maximum value, so in order for the post-stage processing to work well, it should be corrected in advance using a ** Gaussian filter **.
\bf h' = \rm \it K \circledast \bf h
$ \ Circledast $ is a convolution operation (Gaussian filter).
The figure below is a heat map before correction (left) and after correction (right). You can see that the jaggedness is gone.
Next, search for the position of the joint point from the heat map corrected in (a).
First of all, as a major premise, we make the assumption that the estimated heat map is ** "two-dimensional normal distribution" as well as the training data **.
\mathcal{G}(\bf x \rm; \bf \mu \rm , \Sigma) = \frac{1}{(2 \pi)|\Sigma|^{\frac{1}{2}}} \exp \left(-\frac{1}{2}(\bf x \rm - \bf \mu\rm)^{T} \Sigma^{-1}(\bf x \rm - \bf \mu \rm)\right)
$ \ Bf x $ is the pixel position of the heatmap, $ \ bf \ mu $ is the center coordinates of the normal distribution (** ← the coordinates you want to find **), and $ \ rm \ Sigma $ is the covariance matrix.
Based on this assumption, the coordinates of the joint point $ \ bf \ mu $ can be derived by adding the condition that "the slope is zero when the slope is taken at the maximum value of the normal distribution". (For details, solve using the Taylor expansion of the quadratic approximation.)
\bf \mu = m - \left( \mathcal{D^{\prime \prime}\rm (\bf m \rm )} \rm \right) ^{-1} \mathcal{D^{\prime}\rm (\bf m \rm )}
$ \ Bf m $ is the maximum pixel position.
Finally, multiply the estimated joint coordinates by a constant to return them to the size of the original image. (This part is the same as the conventional method.)
\hat{\bf p} = \rm{\lambda} \, \bf p
Through the above three steps, we have succeeded in improving accuracy. ↓
From the latest paper in 2020, I explained the 2D attitude estimation model ** "DARK" **.
DARK can be incorporated into various posture estimation models, so I think it's a good idea to incorporate it into the model you are using now! If you're interested, be sure to check out Papers and GitHub.
** Those who want to digitize and analyze people and things **. ** I want to create something interesting by combining 3D technology and deep learning! Those who say **.
We are looking for friends to work with.
If you are interested, please apply from the link below! https://www.wantedly.com/companies/sapeet
Recommended Posts