In 2019, a paper called ** First Order Motion Model for Image Animation ** It was announced. The content is ** move still images like videos ** using still images and videos in the same category.
A picture is worth a thousand words, and you can easily do this.
The left is ** still image of Kasumi Arimura **, the center is ** video of President Trump **, and the right is ** video of Kasumi Arimura moving like President Trump **.
This is a schematic diagram of the model and is roughly divided into two parts: ** Motion Module ** and ** Generation Module **. Inputs are still image ** source ** and video ** Driving Frame **.
** Motion Module ** is for each frame of still image and video, ** key point position ** (eyes, nose, mouth, face contour, etc. for face), ** gradient at key points Calculate ** relative to ** Reference image ** (shown as R in the figure) **.
Then, from the key point calculation and ** Jacobian ** $ J_k $ (ratio of the size of the still image and the video frame), ** which pixel of the still image should be used to make the pixel of each frame of the video ** Compute the ** mapping function ** that indicates.
In this way, instead of calculating the ** difference between each frame of the still image and the moving image **, by calculating the ** difference between the conceptual reference image called the Reference image **, each frame of the still image and the moving image However, it can be calculated independently, and even if the difference between the two images is large, it can be converted well.
Also, when you move a still image, some parts are not included in the still image in the first place, but we are doing a process called ** occlusion ** (displayed with O) to handle it well.
** Generation Module ** outputs images from those results and still images. The loss function is ** error between still image and output image ** + ** regularization term ** (constraint condition that makes the mapping function close to a simple conversion).
During learning, still images and videos are selected from the same video and any frame is selected. Inference can be done with any ** still image or video in the same category as learning ** (this is the great thing about this model).
Trained weights are provided, so let's use them to generate videos from stills and videos. Download the code and sample data (weights, images, videos) in advance.
First, define the function display () that displays still images and videos on google colab.
import imageio
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from skimage.transform import resize
from IPython.display import HTML
import warnings
warnings.filterwarnings("ignore")
def display(source, driving, generated=None):
fig = plt.figure(figsize=(8 + 4 * (generated is not None), 6))
ims = []
for i in range(len(driving)):
cols = [source]
cols.append(driving[i])
if generated is not None:
cols.append(generated[i])
im = plt.imshow(np.concatenate(cols, axis=1), animated=True)
plt.axis('off')
ims.append([im])
ani = animation.ArtistAnimation(fig, ims, interval=50, repeat_delay=1000)
plt.close()
return ani
The arguments are still images, videos, and generated videos, and the videos and generated videos are in ** ndarray format **. ** concatenate ** still images, videos, and generated videos frame by frame, and return the animation format of ** animation ** of ** matplotlib **.
Now let's generate a video of the face category. Still images / videos can be pictures, dolls, or animations as long as the key point positions of the face (eyes, nose, mouth, facial contours, etc.) can be detected, so here the still images are ** Mona Lisa ** and the videos are *. Try using * President Trump **.
from demo import load_checkpoints
from demo import make_animation
from skimage import img_as_ubyte
source_image = imageio.imread('./sample/05.png')
driving_video = imageio.mimread('./sample/04.mp4')
#Resize image and video to 256x256
source_image = resize(source_image, (256, 256))[..., :3]
driving_video = [resize(frame, (256, 256))[..., :3] for frame in driving_video]
generator, kp_detector = load_checkpoints(config_path='config/vox-256.yaml',
checkpoint_path='./sample/vox-cpk.pth.tar')
predictions = make_animation(source_image, driving_video, generator, kp_detector, relative=True)
HTML(display(source_image, driving_video, predictions).to_html5_video())
great! ** Mona Lisa has begun to move like President Trump. ** It's a world of everything.
This model has a mode that aligns the key point position of the face of the generated video with the video (that is, aligns the face with the video), so if you try that,
predictions = make_animation(source_image, driving_video, generator, kp_detector, relative=False, adapt_movement_scale=True)
HTML(display(source_image, driving_video, predictions).to_html5_video())
The generated video is basically Mona Lisa, but since the key point position is President Trump, we are entering a world where we do not understand the meaning of ** "Mona Lisa similar to President Trump" ** (laugh).
The video you use can be anything as long as the face position is roughly correct. So, if you make a video that crops only the face from the video you got, you can use it.
!ffmpeg -i ./sample/07.mkv -ss 00:08:57.50 -t 00:00:08 -filter:v "crop=600:600:760:50" -async 1 hinton.mp4
If you use ** ffmpeg ** like this, you can make a cropped video in one line. To briefly explain the arguments,
**-i **: Video path (The code is .mkv, but mp4 is OK) **-ss **: Video crop start time (hours: minutes: seconds: 1/100 seconds) **-t **: Crop video length (hours: minutes: seconds) **-filter: v "crop = XX: XX: XX: XX" **: (horizontal size of crop image: vertical size of crop image: horizontal position of upper left corner of crop image: vertical position of upper left corner of crop image )
It's easy to use, so give it a try.
If the key point position can be detected, any category is OK, so let's try ** whole body category ** this time.
Using the ** Fashion video dataset **, which is a dataset that collects the movements of fashion models **, we use the weights learned from the ** body key point positions ** (head, face, joints, etc.).
Let's move ** Haru **'s still image like a ** fashion model ** video.
from demo import load_checkpoints
from demo import make_animation
from skimage import img_as_ubyte
source_image = imageio.imread('./sample/fashion003.png')
driving_video = imageio.mimread('./sample/fashion01.mp4', memtest=False)
#Resize image and video to 256x256
source_image = resize(source_image, (256, 256))[..., :3]
driving_video = [resize(frame, (256, 256))[..., :3] for frame in driving_video]
generator, kp_detector = load_checkpoints(config_path='config/fashion-256.yaml',
checkpoint_path='./sample/fashion.pth.tar')
predictions = make_animation(source_image, driving_video, generator, kp_detector, relative=True)
HTML(display(source_image, driving_video, predictions).to_html5_video())
The left is ** still image of Mr. Haru **, the center is ** video of fashion model **, and the right is ** video of Mr. Haru moving like a fashion model **. The movement is smooth.
I have created a code in ** Google Colab ** that allows you to try out the contents introduced so far, so if you want to try it, this ** "link" ** Click (/github/cedro3/first-order-model/blob/master/demo_run.ipynb) to move it.
(reference) ・ AliaksandrSiarohin / first-order-model ・ [Reading thesis] First Order Motion Model for Image Animation