[PYTHON] TFRecord file creation memorandum for object detection

Introduction

In the previous article (https://qiita.com/IchiLab/items/fd99bcd92670607f8f9b), we summarized how to use the Object Detection API provided by TensorFlow. In order to use this API to teach "this object is XX" from images and videos, Annotate and convert to TFRecord format, It will be used as teacher data and verification data, but I feel that there are still few articles that mention the specific contents of this TFRecord format.

In this article, we will start with a method that can be created without programming. We will summarize from various angles, including how to write and create Python code. Of course, I'm just summarizing the findings I found through trial and error. Please note that there may be some parts that cannot be reached.

Contents to introduce

--Procedure for creating a TFRecord file on Microsoft's VoTT ――What kind of file is TFRecord? --How to cut out the annotation part from the annotated data --Try creating a TFRecord for object detection in Python

Procedure for creating a TF Record file with VoTT

VoTT installation

Download it from the following site. Select .exe if the OS is Windows or .dmg if the OS is Mac. VoTT

Initial setting

First, create a new project with "New Project".

Set the project.

Display Name: Please use any name you like.
Security Token: OK by default.
Source Connection: Setting name of the place to read the image
Target Connection: The setting name of the location where the project will be saved.
The folders .vott, .json and TFRecord are saved in the directory specified here.

When VoTT specifies each directory, it saves the setting information such as the location of the folder with "Add Connection". In the "~ Connection" item, the setting name is specified. Therefore, when using it for the first time, start by creating the setting with "Add Connection" on the right side.

Set the directory.

Display Name: The name of the setting. Anything is OK as long as the folder name is easy to understand.
Provider: This time, we will specify a local directory, so select "Local File System".
Folder Path: Can be set by specifying the above in Provider. Let's select a folder.

Finally, select "Save Connection".

After setting "Source Connection" and "Target Connection" respectively, the next step is annotation work.

Annotation

Annotation can be done at the second from the top on the left (below the house mark). (The photo shows my cats Mimmy and Kitty)

First, set the tag. It says TAGS on the right side, and there is a + icon next to it. If you select this, you can set a new tag. This time, I set the name of my cat as "Mimmy" and "Kitty".

Then select the second square icon from the left at the top. Then drag the place you want to annotate and enclose it.

It may be a gray square at the time of enclosing. If you want to give an arbitrary tag name at the moment of enclosing, after selecting the tag name on the right side, If you select an icon like a key mark and set "I will attach it with this tag fixed", When you annotate, it will automatically give the tag name. (Or, on Mac, you can do the same operation by holding down the Command key and clicking the tag name. Ctrl key on Windows ...? Unconfirmed)

You can also annotate by fixing it to a square by pressing the Shift key during annotation. It's a good idea to get used to this area by touching it.

Export TFRecord and json

To generate a TFRecord, it must be set in advance in the export settings. Select the fourth arrow icon from the top in the menu icon on the left.

Provider: Select "Tensorflow Records".
Asset State: Select Only tagged Assets.
If you mistakenly set it to "Only Visited Assets",
the images will be exported even if you check the images but do not annotate them.
Please note that this will affect the learning results.

After completing the settings, select "Save Export Settings" to save the settings.

After that, return to the annotation screen, save the project with the floppy icon on the upper right side, and export the format (TFRecord this time) set with the arrow icon on the upper right. By the way, the json file is created without permission when you save it even if you do not set anything.

The above is the procedure for creating a TFRecord file using VoTT.

The next is the main subject of this article.

What kind of file is TFRecord?

What is TF Record

What is TFRecord in the first place?

Here's an excerpt from the official TensorFlow tutorial:

The TFRecord format is a simple format for storing a series of binary records. Protocol buffers are platform- and language-independent libraries that efficiently serialize structured data.

Reference) Usage of TFRecords and tf.Example

Just reading this doesn't come to mind. Now let's take a look at the contents of the TFRecord that was exported using VoTT earlier.

Read the TFRecord file and try to visualize it

The contents of TFRecord can be visualized with the following sources.

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import numpy as np
import IPython.display as display

#Specify the path of TFRecord
filenames = 'VoTT/Cat/Cat-TFRecords-export/Mimmy_and_Kitty.tfrecord'
raw_dataset = tf.data.TFRecordDataset(filenames)

#Export the read content to another format
# （.It may be txt. If it is json, it depends on the editor, but it is colored and easy to see, so it is recommended from txt)
tfr_data = 'tfr.json'

for raw_record in raw_dataset.take(1):
    example = tf.train.Example()
    example.ParseFromString(raw_record.numpy())
    print(example)

    #Write to a file. It's not required because you can see it on the console without exporting it.
    with open(tfr_data, 'w') as f:
        print(example, file=f)

Let's take a look at the file exported from the above source.

features {
  feature {
    key: "image/encoded"
    value {
      bytes_list {
        value: "\377\330\377...
        .....(Since it is a large amount, it is omitted)..."
      }
    }
  }
  feature {
    key: "image/filename"
    value {
      bytes_list {
        value: "Mimmy_and_Kitty.jpg "
      }
    }
  }
  feature {
    key: "image/format"
    value {
      bytes_list {
        value: "jpg"
      }
    }
  }
  feature {
    key: "image/height"
    value {
      int64_list {
        value: 1440
      }
    }
  }
  feature {
    key: "image/key/sha256"
    value {
      bytes_list {
        value: "TqXFCKZWbnYkBUP4/rBv1Fd3e+OVScQBZDav2mXSMw4="
      }
    }
  }
  feature {
    key: "image/object/bbox/xmax"
    value {
      float_list {
        value: 0.48301976919174194
        value: 0.7260425686836243
      }
    }
  }
  feature {
    key: "image/object/bbox/xmin"
    value {
      float_list {
        value: 0.3009025752544403
        value: 0.5285395383834839
      }
    }
  }
  feature {
    key: "image/object/bbox/ymax"
    value {
      float_list {
        value: 0.6981713175773621
        value: 0.8886410593986511
      }
    }
  }
  feature {
    key: "image/object/bbox/ymin"
    value {
      float_list {
        value: 0.3555919826030731
        value: 0.5664308667182922
      }
    }
  }
  feature {
    key: "image/object/class/label"
    value {
      int64_list {
        value: 0
        value: 1
      }
    }
  }
  feature {
    key: "image/object/class/text"
    value {
      bytes_list {
        value: "Mimmy"
        value: "Kitty"
      }
    }
  }
  feature {
    key: "image/width"
    value {
      int64_list {
        value: 2560
      }
    }
  }
}

Other keys such as difficult, truncated, view, source_id are included, Here, I have extracted only the contents that I think are necessary. If you check the contents, you can see that it has the following structure.

ʻImage / encoded`: Binary data of the image
ʻImage / width, ʻimage / height: Image size
xmin, xmax, ymin, ymax: Annotated coordinate information. Only the number of annotations are included.
class / text, class / label: Tag information. You can think of text as the name of the tag and the number on the label as the number given to the tag name. In this case, "Mimmy" is 0 and "Kitty" is 1.

At this point, you've probably come to understand what kind of structure TFRecord is made up of.

How to cut out the annotation part from the annotated data

The following is a method to programmatically cut out the part annotated with VoTT. If you can do this, when you write your own program and create a TFRecord file, You may need an idea to ** synthesize the object you want to detect with the background **, as described below. Or it will be useful for machine learning of image classification.

Also, this is what I noticed after trying, ** The orientation of the image seen by VoTT and the orientation of the image when actually cropping may be different ** I found out that.

In other words, the image you see on the screen when annotating in VoTT, It means that the orientation of the image of the original data when cutting out with json information was sometimes different by 180 degrees.

Thanks to that, only one image of an unintended place was cut out. I'm not sure if this is annotated in the correct orientation in TFRecord, so It may be safe to look at it once.

Well, the introduction has become long, but let's check the json for image cropping immediately. I mentioned earlier that json is also automatically exported when you finish the annotation work with VoTT and export it.

The cat data given in the example was written out as follows.

{
    "asset": {
        "format": "jpg",
        "id": "1da8e6914e4ec2e2c2e82694f19d03d5",
        "name": "Mimmy_and_Kitty.jpg ",
        "path": "【Folder name】/VoTT/Cat/IMAGES/Mimmy_and_Kitty.jpg ",
        "size": {
            "width": 2560,
            "height": 1440
        },
        "state": 2,
        "type": 1
    },
    "regions": [
        {
            "id": "kFskTbQ6Z",
            "type": "RECTANGLE",
            "tags": [
                "Mimmy"
            ],
            "boundingBox": {
                "height": 493.3142744479496,
                "width": 466.2200532386868,
                "left": 770.3105590062112,
                "top": 512.0524447949527
            },
            "points": [
                {
                    "x": 770.3105590062112,
                    "y": 512.0524447949527
                },
                {
                    "x": 1236.5306122448978,
                    "y": 512.0524447949527
                },
                {
                    "x": 1236.5306122448978,
                    "y": 1005.3667192429023
                },
                {
                    "x": 770.3105590062112,
                    "y": 1005.3667192429023
                }
            ]
        },
    ],
    "version": "2.1.0"
}

(The information on the other Kitty tag is omitted because it will be long if it is placed.)

As you can see, it contains the filename and image size, as well as the annotated coordinate information. It is quite possible to cut out an image using only this information.

The annotated coordinate information is carefully included in two types, boundingBox and points. There seem to be various ways, but this time I went to see the boundingBox and tried to cut it out. Below is the source code.

import json
import os
import fnmatch
import cv2 as cv

JSON_DIR = 'VoTT/Cat/'
IMG_DIR = 'VoTT/Cat/'
CUT_IMAGE = 'cut_images/'
CUT_IMAGE_NAME = 'cat'
IMAGE_FORMAT = '.jpg'

class Check():

    def filepath_checker(self, dir):
        
        if not (os.path.exists(dir)):
            print('No such directory > ' + dir)
            exit()

    def directory_init(self, dir):

        if not(os.path.exists(dir)) :
            os.makedirs(dir, exist_ok=True)

def main():
    
    check = Check()

    #Check if the directory containing the json file exists
    check.filepath_checker(JSON_DIR)
    
    #'Prepare a storage location for CUT images'
    check.directory_init(CUT_IMAGE)

    #Analyze json and cut out from image and annotation coordinates
    count = 0
    for jsonName in fnmatch.filter(os.listdir(JSON_DIR), '*.json'):

        #open json
        with open(JSON_DIR + jsonName) as f :
            result = json.load(f)

            #Get image file name
            imgName = result['asset']['name']
            print('jsonName = {}, imgName = {} '.format(jsonName, imgName))
            
            img = cv.imread(IMG_DIR + imgName)
            if img is None:
                print('cv.imread Error')
                exit()

            #Loop as many times as annotated
            for region in result['regions'] :
                
                height = int(region['boundingBox']['height'])
                width = int(region['boundingBox']['width'])
                left = int(region['boundingBox']['left'])
                top = int(region['boundingBox']['top'])

                cutImage = img[top: top + height, left: left + width]
                #Avoid information that you accidentally clicked on one point during annotation
                if height == 0 or width == 0:
                    print('<height or width is 0>  imgName = ', imgName)
                    continue
                
                #Uncomment if you want to resize before exporting
                #cutImage = cv.resize(cutImage, (300,300))

                #「cut_images/cat0000.Export files with serial numbers such as "jpg"
                cv.imwrite(CUT_IMAGE + CUT_IMAGE_NAME + "{0:04d}".format(count + 1) + IMAGE_FORMAT, cutImage)
                print("{0:04d}".format(count+1))
                count += 1

if __name__ == "__main__":
    main()

There is a conditional branch of ʻif height == 0 or width == 0` in the source code. The part clicked by mistake during annotation with VoTT remains as data, Because there was an error because there was no area to cut out I included it to avoid that human error.

In my case, there was a need to annotate a lot on one sheet, so It was an increasingly difficult situation to notice. Moreover, it is even more so when there is a large amount of image data.

Well, it's been a long time, but let's write a program to create TFRecord from now on.

Create a TFRecord for object detection in Python

Now that you have a rough idea of the composition of the contents of TFRecord Finally, let's write the source code and generate TFRecord.

Source code description

What we are doing with the sources posted this time is as follows.

Prepare an image of the object (cat) to be combined with the background image in advance.
Combine the background image and the object image (fixed at an arbitrary position this time)
Organize the information required for TFRecord, such as the combined coordinate position information.
Generate a TFRecord file

First, the background image was borrowed from the material site. it's here.

And here is the object image to be combined.

Source code sample

Below is the source code.

import tensorflow as tf
import cv2 as cv
import utils.dataset_util as dataset_util


def img_composition(bg, obj, left, top):
    """
Function that synthesizes background and object
    ----------
    bg : numpy.ndarray ~ background image
    obj : numpy.ndarray ~ object image
    left :int ~ Coordinates to synthesize (left)
    top :int ~ Coordinates to synthesize (above)
    """
    bg_img = bg.copy()
    obj_img = obj.copy()

    bg_h, bg_w = bg_img.shape[:2]
    obj_h, obj_w = obj_img.shape[:2]
 
    roi = bg_img[top:top + obj_h, left:left + obj_w]
    mask = obj_img[:, :, 3]

    ret, mask_inv = cv.threshold(cv.bitwise_not(mask), 200, 255, cv.THRESH_BINARY)

    img1_bg = cv.bitwise_and(roi, roi, mask=mask_inv)
    img2_obj = cv.bitwise_and(obj_img, obj_img, mask=mask)
    dst = cv.add(img1_bg, img2_obj)

    bg_img[top: obj_h + top, left: obj_w + left] = dst
    
    return bg_img


def set_feature(image_string, label, label_txt, xmins, xmaxs, ymins, ymaxs):
    """
Function to set the information to be written to TFRecord
To use this function, you can use it in the "object detection" directory of the TensorFlow Object Detection API.
You need to bring the "util" library
    ----------
    image_string :bytes ~ Precomposed image information
    label :list ~ Annotated tag number
    label_txt :list ~ Annotated tag name
    xmins, xmaxs, ymins, ymaxs :list ~ 0 annotated coordinates.0~1.Value represented by 0
    """
    image_shape = tf.io.decode_jpeg(image_string).shape

    feature = {
        'image/encoded': dataset_util.bytes_feature(image_string),
        'image/format': dataset_util.bytes_feature('jpg'.encode('utf8')),
        'image/height': dataset_util.int64_feature(image_shape[0]),
        'image/width': dataset_util.int64_feature(image_shape[1]),
        'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
        'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
        'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
        'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),

        #If you want to give only one piece of information, you can use this commented out function (match the types).
        # 'image/object/class/label': dataset_util.int64_feature(label),
        # 'image/object/class/text': dataset_util.bytes_feature(LABEL.encode('utf8')),
        
        #If you want to tag two or more on one sheet, click "_list_Use a function that contains ""
        #Of course, you can use only one, so it is recommended to use this after all
        'image/object/class/label': dataset_util.int64_list_feature(label),
        'image/object/class/text': dataset_util.bytes_list_feature(label_txt),
    }
    return tf.train.Example(features=tf.train.Features(feature=feature))

def main():

    #Each file path name
    bg_image_path = './comp/bg.jpg'
    obj_img_path = './comp/Mimmy_image.png'
    comp_img_path = './comp/img_comp.jpg'
    tfr_filename = './mimmy.tfrecord'

    #For TF Record
    tag = {'Mimmy': 0, 'Kitty': 1, 'Mimelo': 2}
    xmins = []
    xmaxs = []
    ymins = []
    ymaxs = []
    class_label_list = []
    class_text_list = []
    datas = {}

    #Label name setting
    class_label = tag['Mimmy']

    #Loading background
    bg_img = cv.imread(bg_image_path, -1)
    bg_img = cv.cvtColor(bg_img, cv.COLOR_RGB2RGBA)
    bg_h, bg_w = bg_img.shape[:2]

    #Loading an object
    obj_img = cv.imread(obj_img_path, -1)
    obj_img = cv.cvtColor(obj_img, cv.COLOR_RGB2RGBA)
    scale = 250 / obj_img.shape[1]
    obj_img = cv.resize(obj_img, dsize=None, fx=scale, fy=scale)
    obj_h, obj_w = obj_img.shape[:2]

    #Combine background and object
    x = int(bg_w * 0.45) - int(obj_w / 2)
    y = int(bg_h * 0.89) - int(obj_h / 2)
    comp_img = img_composition(bg_img, obj_img, x, y)
    
    #Exporting a composite image
    cv.imwrite(comp_img_path, comp_img)

    #Added to TFRecord coordinate information list
    xmins.append(x / bg_w)
    xmaxs.append((x + obj_w) / bg_w)
    ymins.append(y / bg_h)
    ymaxs.append((y + obj_h) / bg_h)

    #Added TFRecord label information
    class_label_list.append(class_label)
    class_text_list.append('Mimmy'.encode('utf8'))
    datas[comp_img_path] = class_label

    #Process to create TFRecord
    with tf.io.TFRecordWriter(tfr_filename) as writer:
        for data in datas.keys():
            image_string = open(data, 'rb').read()
            tf_example = set_feature(image_string, class_label_list, class_text_list, xmins, xmaxs, ymins, ymaxs)
            writer.write(tf_example.SerializeToString())

if __name__ == "__main__":
    main()

Here is the image created after executing the program.

Source code supplement

As mentioned in the comments in the source code, please note that the library included in the TensorFlow Object Detection API is required.

Required libraries
utils
Sources that are very helpful in generating TFRecord (I learned how to calculate coordinate information such as xmin, ymin)
dataset tools

This time, for the sake of explanation, I introduced one image for compositing and the source code to generate one TFRecord file. However, I think it is actually necessary to generate a large number of TFRecord files from more and more images.

In my case, how to randomly select and combine images from the specified folder, I tried a method of mass production by making the coordinates to be combined a little random.

If you can find out various tips for mass production of teacher data from this article, I would love to hear from you.

Check the contents of the created TFRecord

Finally, just in case, the contents of the TFRecord file I just created, Let's check with the method introduced in the first half.

features {
  feature {
    key: "image/encoded"
    value {
      bytes_list {
        value: "\377\330\377...
        .....(Since it is a large amount, it is omitted)..."
      }
    }
  }
  feature {
    key: "image/format"
    value {
      bytes_list {
        value: "jpg"
      }
    }
  }
  feature {
    key: "image/height"
    value {
      int64_list {
        value: 1397
      }
    }
  }
  feature {
    key: "image/object/bbox/xmax"
    value {
      float_list {
        value: 0.5151041746139526
      }
    }
  }
  feature {
    key: "image/object/bbox/xmin"
    value {
      float_list {
        value: 0.38489583134651184
      }
    }
  }
  feature {
    key: "image/object/bbox/ymax"
    value {
      float_list {
        value: 0.9878310561180115
      }
    }
  }
  feature {
    key: "image/object/bbox/ymin"
    value {
      float_list {
        value: 0.7916964888572693
      }
    }
  }
  feature {
    key: "image/object/class/label"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "image/object/class/text"
    value {
      bytes_list {
        value: "Mimmy"
      }
    }
  }
  feature {
    key: "image/width"
    value {
      int64_list {
        value: 1920
      }
    }
  }
}

As mentioned above, you only have to do it once, so check the contents properly and If it is the intended value, there will be no problem!

at the end

For data in TFRecord format that is useful for object detection in TensorFlow It's been a long article, but I've summarized it as far as I can understand.

We hope that this article will expand the range of teacher data creation. Thank you for reading to the end.