[PYTHON] I tried to detect Mario with pytorch + yolov3

Background In [OpenCV] [C ++] I tried to detect multiple using template matching, I used OpenCV's Template Matching to detect Goomba. I scan the prepared image of Goomba from top to bottom to calculate the area with similar shape, but it seems that the upper cloud is similar to Goomba, and when the cloud and Goomba come out together, the cloud The one was detected first. It seems that it is possible to roughly trim areas with high similarity and then use the template image and histogram or background subtraction to make a judgment, but this time I changed my mind and tried object detection using yolov3. I will try.

Device

CPU AMD Ryzan 5 1400
GPU GeForce GTX960
Mother Board MSI B450 GAMING PLUS MAX B450 --Memory DDR4 8G x 4 = 32G
CUDA 10.0
cuDNN 7.4

Environment I thought that I installed the necessary libraries, and when I looked at pytorch official, the story was proceeding based on anaconda. Until now, I used to create a virtual environment with python3 -m venv [envname] and install the necessary packages in it using pip, but Deep Learning libraries have dependencies on other packages. It seems quite likely, so I will use anaconda.

conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.1 -c pytorch

The environment of cuda is still 10.0, but I could learn without any problem even if I set the version of cudatoolkit to 10.1. If the settings around the GPU are slightly different, an error will occur at the stage of operation, so if it is true, it is better to match.

pytorch + yolov3 Use PyTorch-YOLOv3. git clone to download the dataset and weight file. It's coco, so it's quite heavy.

git clone https://github.com/eriklindernoren/PyTorch-YOLOv3
cd PyTorch-YOLOv3/
sudo pip3 install -r requirements.txt

cd weights/
bash download_weights.sh

cd data/
bash get_coco_dataset.sh

Then check if it works.

python3 test.py --weights_path weights/yolov3.weights

If that doesn't seem to be a problem, check if the object can be detected with the default settings. Store the image you want to detect in data/samples and execute it with detect.py.

python3 detect.py --image_folder data/samples/

A output folder is created directly under the work folder, so if you look at the contents, the image marked by object detection is stored.

Annotation By default there is a trained weight file that uses coco as the dataset and can detect 80 objects. This time, I will write a machine learning method when there is an object that the individual developer wants to detect.

As a flow, prepare 100 images with the target object, Annotate (annotate, label) what object is in which place in each image, and finally execute with train.py And learn.

The data set uses the video of Super Mario Bros (NES) Level 1-1 as written in Background. I'm worried about what data to use for machine learning, but one conclusion is that it seems better to use a game video that seems to be copyrightable. This is because fixed characters that are more fixed in the game than the images taken in reality appear repeatedly, and there are only a few patterns in the shape, so I think it's easy to get learning results immediately. Also, since it is a video, it can be captured in about 30 to 60 images per second, so it can be easily acquired.

So, use the labelme tool to annotate.

Since the work PC is a Mac, install Qt5 with homebrew and install the main body with pip.

brew install pyqt
pip install labelme

Before starting labelme, prepare the training data. This time, I extracted the 23-28 second video of Super Mario Bros (NES) Level 1-1 and put it in the serial number image by the following method.

Training data

wget https://raw.githubusercontent.com/wkentaro/dotfiles/f3c5ad1f47834818d4f123c36ed59a5943709518/local/bin/video_to_images
pip install imageio imageio-ffmpeg tqdm
python video_to_images your_video.mp4

When executed, a your_video folder will be created and the serial number images will be stored in it.

Now that the training data is ready, specify the folder path and start labelme.

labelme ./your_video

After that, specify the label and area of the object. There were about 150 sheets. When learning with yolov3, you only need to specify the range with bounding box and rectangle, but since you may handle DNN to be learned with segmentation later, you uselessly specify the area: sweat_smile: スクリーンショット 2021-01-14 14.18.22.png

The label and area attached to the object are generated as a json file in the folder where the image is loaded.

When you're done, keep it in a compressed file.

tar -zcvf output.tar.gz ./your_video

Training

Now that the training data is ready, let's try machine learning.

First, create a config file. Here, if you include __ignore__, the number of classes is 3, so set it with <num-classes> = 3. After execution, config/yolov3-custom.cfg will be output.

cd config/                                
bash create_custom_model.sh <num-classes>

Open config/custom.data and set the number of classes.

`config/custom.data`


classes=3
train=data/custom/train.txt
valid=data/custom/valid.txt
names=data/custom/classes.names

Open data/custom/classes.names to list the object names. I think that only train is written by default.

`data/custom/classes.names`


__ignore__
mario
kuribo

Next, place the compressed training data folder directly under it and decompress it.

tar -zxvf output.tar.gz

Move the images from the unzipped folder to data/custom/images /. By default, there is a train image in data/custom/images /, so delete it.

rm data/custom/images/train.jpg
mv ./your_video/*.jpg data/custom/images/

Next, based on the json file in which the label information is written, rewrite it as [class ID] [object center coordinate x] [object center coordinate y] [object width] [object height] .. At this time, the ratio [0,1.0] to the entire image is output instead of the original data. The rewritten file will be generated in data/custom/labels /.

import os
import json
import numpy as np

def treat(filepath, classes):
  
  with open(filepath, "r") as fin:
    src = json.load(fin)

    dst = []
    for item in src["shapes"]:
      txt = item["label"]
      #Calculate the average value of each coordinate
      cx, cy = np.mean(np.array(item["points"]), axis=0)
      #1 for the total length of the image.Calculate the ratio when 0 is set
      cx_norm = cx / src["imageWidth"]
      cy_norm = cy / src["imageHeight"]
   
      #Calculate the width and height of the object
      min_x, min_y = np.min(np.array(item["points"]), axis=0)
      max_x, max_y = np.max(np.array(item["points"]), axis=0)
      rect_width = (max_x - min_x) / src["imageWidth"]
      rect_height = (max_y - min_y) / src["imageHeight"]

      #Search for class ID
      idx = list(filter(lambda x: x[1] == txt, classes))[0][0]  

      #Arrange and format
      dst.append([idx, cx_norm, cy_norm, rect_width, rect_height])
    return dst

Finally, write the file paths stored in data/custom/images / to data/custom/train.txt (for training) and data/custom/valid.txt (for evaluation) respectively. I think the ratio is just right: (for training): (for evaluation) = 8: 2.

`train.txt`


data/custom/images/00000000.jpg
data/custom/images/00000001.jpg
data/custom/images/00000002.jpg
data/custom/images/00000003.jpg
data/custom/images/00000004.jpg
data/custom/images/00000005.jpg
data/custom/images/00000006.jpg
data/custom/images/00000007.jpg
data/custom/images/00000008.jpg
data/custom/images/00000009.jpg
data/custom/images/00000010.jpg
data/custom/images/00000011.jpg
data/custom/images/00000012.jpg
data/custom/images/00000013.jpg
data/custom/images/00000014.jpg
data/custom/images/00000015.jpg
...

Now that the settings are complete, run train.py.

python3 train.py \
--model_def config/yolov3-custom.cfg \
--data_config config/custom.data \
--batch_size 2 \
--img_size 32 \
--epochs 200 \
--pretrained_weights weights/darknet53.conv.74

The default for batch_size is 8, and img_size is 416, but if the machine performance is weak, a memory error will occur. My PC was out because the GPU memory is only 4GB. In that case, lowering the value will allow the learning to proceed normally.

As for the learning result, yolov3_ckpt_ {number of epochs} .pth is output to checkpoints/for each epoch.

Detect Objects

First, prepare the image data for object detection. Here, we will use the video frame from the start to the goal of Super Mario Bros (NES) Level 1-1. To convert a video to a serial number image, use the video_to_images script in the same way as you created with the training data. We were able to acquire a total of 1501 images.

So, let's try detection using detect.py.

python3 detect.py --image_folder ./data/mario_1-1/ \
--weights_path ./checkpoints/yolov3_ckpt_199.pth \
--model_def config/yolov3-custom.cfg \
--class_path data/custom/classes.names

Specify the path of the folder containing the image you want to test with --image_folder, and use the file generated by learning for --weights_path. The result is stored in the output file.

(239) Image: './data/mario_1-1/00000239.jpg'
	+ Label: mario, Conf: 0.99997
...

After that, convert the serial number image to a video and you're done. At first, I converted it with ffmpeg as ffmpeg -r 30 -i% 8d.png -vcodec libx264 -pix_fmt yuv420p -r 60 out.mp4, but the image quality was extremely degraded. (See Generate video from serial number image with ffmpeg / Generate serial number image from video ~ Prevent frame dropping ~)

So I converted it to a video using OpenCV.

import cv2
import os

def main():
	
	is_png = lambda x : os.path.splitext(x)[1] == ".png "
	imgs = list(filter(is_png, os.listdir()))

	imgs.sort()

	width = 480
	height = 270
	fps = 30

	fmt = cv2.VideoWriter_fourcc('m', 'p', '4', 'v')
	writer = cv2.VideoWriter('output.mp4', fmt, fps, (width, height))
	
	#resize
	for img in imgs:
		mat = cv2.imread(img)
		dst = cv2.resize(mat, dsize=(width, height))
		writer.write(dst)

	writer.release()

if __name__ == "__main__":
	main()

Consequence

↓ Click to watch a video of youtube detecting from the start to the goal.

――I haven't learned Fire Mario, but I identify it as Mario because it looks similar. ――Since Chibi Mario has not learned, it may be mistaken for Goomba probably because of its similar height. ――I think that the accuracy will be even better if you target objects such as blocks, treasure chests, and koopa troopa.

Reference