things to do

Identify what is in what position in which frame of the video. Use google's Video Intelligence API to detect objects from videos. The code presented in this article is based on the Official Getting Started Guide.

Preparation

Video Intelligence API [Authentication for API] in the official guide (https://cloud.google.com/video-intelligence/docs/how-to?hl=ja#api-%E3%81%AB%E5%AF%BE%E3% 81% 99% E3% 82% 8B% E8% AA% 8D% E8% A8% BC) and get the service account key file.

google colaboratory Use google colaboratory to implement and check results.

[This article](https://qiita.com/sosuke/items/533909d31244f986ad47#%E3%82%B5%E3%83%BC%E3%83%93%E3%82%B9%E3%82%A2 % E3% 82% AB% E3% 82% A6% E3% 83% B3% E3% 83% 88% E3% 81% AE% E8% AA% 8D% E8% A8% BC% E6% 83% 85% E5 % A0% B1% E3% 82% 92% E3% 82% A2% E3% 83% 83% E3% 83% 97% E3% 83% AD% E3% 83% BC% E3% 83% 89% E3% 81 Upload the service account key file to colaboratory as shown in% 99% E3% 82% 8B). If you don't want to do it every time, you can put it in the google drive that you mount, but be careful not to accidentally share it.

Image to be analyzed

With the Video Intelligence API

--Video files saved in GCP Storage --Video files saved locally

Can be analyzed. If you want to use a video file saved in GCP Storage, you will be charged a usage fee for Storage in addition to the API fee, so local is recommended if you just want to give it a try.

This time, we will use the method of analyzing "video files saved locally", so save the video you want to analyze in google drive and mount the drive in colaboratory. The drive can be mounted from the left pane of the colaboratory. スクリーンショット 2020-03-29 11.28.34.png

When preparing a video, consider the Usage fee of the Video Intelligence API.

[Supplement] Video Intelligence API usage fee (as of March 2020)

The Video Intelligence API is billed according to the length (length) of the video being analyzed. The length is calculated in minutes, and less than 1 minute is rounded up, so if you annotate with the following three patterns, the usage fee will be the same.

One video of 2 minutes and 30 seconds
One 1 minute 30 second video and one 30 second video
Three 30-second videos

The price of each annotation is as follows

function	First 1,000 minutes	1,Over 000 minutes
Label detection	free	$0.10/Minutes
Shot detection	free	$0.05/Minutes, free if using label detection
Inappropriate content detection	free	$0.10/Minutes
Voice character conversion	free	$0.048/Minutes (Voice-character conversion is charged for en of supported languages-US only)
Object tracking	free	$0.15/Minutes
Text detection	free	$0.15/Minutes
Logo detection	free	$0.15/Minutes
Celebrity recognition	free	$0.10/Minutes

Implementation

Ready to use video intelligence

Install the video intelligence client

!pip install -U google-cloud-videointelligence

Create a videointelligence client

First, [Authentication for API](https://cloud.google.com/video-intelligence/docs/how-to?hl=ja#api-%E3%81%AB%E5%AF%BE%E3%81 Authenticate using the service account key file obtained in% 99% E3% 82% 8B% E8% AA% 8D% E8% A8% BC). service_account_key_name is the path to the service account key file uploaded to colaboratory.

import json
from google.cloud import videointelligence
from google.oauth2 import service_account

#API authentication
service_account_key_name = "{YOUR KEY.json}"
info = json.load(open(service_account_key_name))
creds = service_account.Credentials.from_service_account_info(info)

#Create client
video_client = videointelligence.VideoIntelligenceServiceClient(credentials=creds)

Run the API

First, load the video from the drive.

#Specify the video to be processed and load it
import io

path = '{YOUR FILE PATH}'
with io.open(path, 'rb') as file:
    input_content = file.read()

Then run the API and get the result

features = [videointelligence.enums.Feature.OBJECT_TRACKING]
timeout = 300
operation = video_client.annotate_video(input_content=input_content, features=features, location_id='us-east1')

print('\nProcessing video for object annotations.')
result = operation.result(timeout=timeout)
print('\nFinished processing.\n')

Check the result

Display a list of detected objects

The jupyter notebook draws the pandas DataFrame nicely, so [Response](https://cloud.google.com/video-intelligence/docs/object-tracking?hl=ja#vision-object- Extract only the necessary information from tracking-gcs-protocol) and generate a DataFrame.

This time, get the following from ʻobject_annotations` of the response.

Column name	Contents	Source
Description	Object description (name)	entity.description
Confidence	Detection reliability	confidence
SegmentStartTime	Start time of the segment in which the object appears	segment.start_time_offset
SegmentEndTime	End time of the segment in which the object appears	segment.end_time_offset
FrameTime	How many seconds from the beginning of the video the frame in which the object was detected is	frames[i].time_offset
Box{XXX}	A 100% fraction of the coordinates of each side of the object's bounding box	frames[i].normalized_bounding_box

#List the detected objects
import pandas as pd

columns=['Description', 'Confidence', 'SegmentStartTime', 'SegmentEndTime', 'FrameTime', 'BoxLeft', 'BoxTop', 'BoxRight', 'BoxBottom', 'Box', 'Id']
object_annotations = result.annotation_results[0].object_annotations
result_table = []
for object_annotation in object_annotations:
    for frame in object_annotation.frames:
        box = frame.normalized_bounding_box
        result_table.append([
                object_annotation.entity.description,
                object_annotation.confidence,
                object_annotation.segment.start_time_offset.seconds + object_annotation.segment.start_time_offset.nanos / 1e9,
                object_annotation.segment.end_time_offset.seconds + object_annotation.segment.end_time_offset.nanos / 1e9,
                frame.time_offset.seconds + frame.time_offset.nanos / 1e9,
                box.left,
                box.top,
                box.right,
                box.bottom,
                [box.left, box.top, box.right, box.bottom],
                object_annotation.entity.entity_id
        ])
        #Since it will be huge, only the first frame of each segment for the time being
        break

df=pd.DataFrame(result_table, columns=columns)
pd.set_option('display.max_rows', len(result_table))
#Sort and display by Confidence
df.sort_values('Confidence', ascending=False)

When executed, the following results will be obtained. スクリーンショット 2020-03-29 15.41.11.png

List the frames in which the object was detected as a still image

First, the frame is extracted from the video based on the above time_offset information. Use ʻopenCVto get a still image from a video. Since it is necessary to specify the still image to be cut out in frames, the approximate number of frames is calculated from the FPS of the video andtime_offset` (seconds).

import cv2

images = []
cap = cv2.VideoCapture(path)

if cap.isOpened():
    fps = cap.get(cv2.CAP_PROP_FPS)
    for sec in df['FrameTime']:
        #Calculate the number of frames from fps and seconds
        cap.set(cv2.CAP_PROP_POS_FRAMES, round(fps * sec))
        ret, frame = cap.read()
        if ret:
            images.append(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

Next, add a frame to the object shown in each still image cut out by rectangle of ʻopenCV. Since rectangleneeds to specify the upper left and lower right vertices of the rectangle you want to draw, you need to get these two points. Thenormalized_bounding_box returned from the API contains information on four sides (left, top, right, bottom). For example, in the box shown below, the position of personis shown. The value ofleft is l (width from the left edge of the image to the left edge of the box) / width (width of the entire image) , so when calculating the x coordinate of vertex 1 ( pt1) from the value of leftYou can calculate back withwidth * left`. Prepare the method appropriately.

#Find the coordinates on the image
def ratio_to_pics(size_pics, ratio):
    return math.ceil(size_pics * ratio)

#Get the top left and bottom right vertices from the box
def rect_vertex(image, box):
    height, width  = image.shape[:2]
    return[
        (
            ratio_to_pics(width, box[0]), ratio_to_pics(height, box[1])
        ),
        (
            ratio_to_pics(width, box[2]), ratio_to_pics(height, box[3])
        )
    ]

While calculating the position of the apex of the frame using the above method, the frame is actually written in the image.

boxed_images = []
color = (0, 255, 255)
thickness = 5
for index, row in df.iterrows():
    image = images[index]
    boxed_images.append(cv2.rectangle(image,  *rect_vertex(image, row.Box), color,  thickness = thickness))

Finally, each image is displayed with Description and Confidence. Depending on the length of the video and the number of detected objects, it takes time to display all of them, so a threshold is set for Confidence.

import math
import matplotlib.pyplot as plt

#Cut off appropriately with confidence
min_confidence = 0.7

#Set various figures
col_count = 4
row_count = math.ceil(len(images) / col_count)
fig = plt.figure(figsize = (col_count * 4, row_count * 3), dpi = 100)
num = 0

#Display still images side by side
for index, row in df.iterrows():
    if row.Confidence < min_confidence:
        continue
    num += 1
    fig.add_subplot(row_count, col_count, num, title = '%s : (%s%s)' % (row.Description, round(row.Confidence * 100, 2), '%'))
    plt.imshow(boxed_images[index], cmap='gray')
    plt.axis('off')

When executed, the following results will be obtained. スクリーンショット 2020-03-29 15.46.17.png

reference

[Display the result of video analysis using Cloud Video Intelligence API from Colaboratory. ](Https://qiita.com/sosuke/items/533909d31244f986ad47#%E3%82%B5%E3%83%BC%E3%83%93%E3%82%B9%E3%82%A2%E3%82 % AB% E3% 82% A6% E3% 83% B3% E3% 83% 88% E3% 81% AE% E8% AA% 8D% E8% A8% BC% E6% 83% 85% E5% A0% B1 % E3% 82% 92% E3% 82% A2% E3% 83% 83% E3% 83% 97% E3% 83% AD% E3% 83% BC% E3% 83% 89% E3% 81% 99% E3 % 82% 8B) Official Getting Started Guide

[PYTHON] Detect video objects with Video Intelligence API