[PYTHON] [For those who want to use TPU] I tried using the Tensorflow Object Detection API 2

Introduction

This is Ichi Lab from RHEMS Giken. (* Please note that the title TPU is an abbreviation for Tensor Processing Unit and does not mean thermoplastic polyurethane.)

The previous article is here. [For beginners] I tried using the Tensorflow Object Detection API

TensorFlow's Object Detection API (API) is very useful for creating AI for object detection. On the other hand, I think that there are many people who have the following problems.

This time, I would like to take this opportunity to leave a memorandum on how I was able to use Cloud TPU to the extent that I could afford to pay even at the individual level. Using the API the way in this article makes it possible to make GCP's Cloud TPU much cheaper than using it from start to finish.

We hope that it will be of some help to everyone.

Preface

Rough method

From the conclusion, I think the following is the best way to use the API with TPU at the lowest possible price.

  1. Learn with the free tier of Google Colaboratory
  2. If you want to learn while playing from the free tier, continue with GCP's VM and Cloud TPU (both preemptive)

Google Colaboratory has been introduced a lot in other articles, so I will omit the details, but

At the cost of getting a great and high-performance environment for free, you may not be able to use the GPU or TPU for a while if you overuse it.

In such a case, if you can prepare a similar environment yourself, it will cost money, but you can save the time to wait until you can use it again.

There is also a nice service called Google Colaboratory Pro for $ 9.99 per month, but at the time of writing this article (2020/06) it is a service only in the United States. I will. (There is another article that I was able to register even from Japan, but there is a possibility of violating the rules, so at my own risk & I have not tried it)

Prerequisites

The explanation here is based on the following conditions.

Precautions (story of money)

The method introduced here always uses the service of GCP. And with either method, you will definitely be charged for the usage fee of Cloud Storage. For the first time, GCP has a $ 300 free tier, The free tier does not include the TPU usage fee, and there are some restrictions on the free tier of Cloud Storage, so be sure to check the contents yourself before proceeding. (Cloud ML has a free tier, but I haven't tried it)

Common preparation

In order to train using TPU, it must be stored in Cloud Storage.

Here, we will assume that the names are as follows. Project ID: gcp-project-123 Bucket name: my-bucket-123

Don't forget the -m option if you want to quickly copy from your local PC to your bucket!

Command example to send the folder of the current directory to the bucket with zsh on Mac


gsutil -m cp -r \* gs://my-bucket-123/

The folder structure in the bucket is as follows. (* The following explanation will proceed on the premise of this configuration)

gs://my-bucket-123/
├── models
│     ├── ssd_mobilenet_v1_fpn (Model data of transfer learning source)
│             └── .ckpt and many more
├── data
│     ├── save (Training data storage directory)
│     ├── train (For teacher data storage~ tfrecord)
│     └── val (For data storage for verification~ tfrecord)
├── hoge.config (Config data)
└── tf_label_map.pbtxt (Label data)

This time, I used ssd_mobilenet_v1_fpn_coco as the transfer learning source. In addition, on the page of Tensorflow detection model zoo, there is a ☆ mark on the trained model that supports TPU. It is on.

For the contents of the config, [the above](https://qiita.com/IchiLab/items/fd99bcd92670607f8f9b#%E3%82%B3%E3%83%B3%E3%83%95%E3%82%A3%E3 % 82% B0% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83% AB% E3% 81% AE% E7% B7% A8% E9% 9B% 86) I will omit it because I did it, but Now that the above files are in Cloud Storage, the following items need to be matched accordingly.

fine_tune_checkpoint: "gs://my-bucket-123/models/ssd_mobilenet_v1_fpn/model.ckpt"
label_map_path: "gs://my-bucket-123/tf_label_map.pbtxt"
input_path: "gs://my-bucket-123/data/train/{filename}.tfrecord
input_path: "gs://my-bucket-123/data/val/{filename}.tfrecord

(* How to write the {filename} part is [the above](https://qiita.com/IchiLab/items/fd99bcd92670607f8f9b#%E3%82%B3%E3%83%B3%E3%83%95%E3% 82% A3% E3% 82% B0% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83% AB% E3% 81% AE% E7% B7% A8% E9% 9B% 86))

1. How to do it using Google Colaboratory

If you can do it with the free tier, let's do it here. However, as mentioned earlier, Cloud Storage charges will be incurred.

1-1. Put the API source on the drive

It can be local or container, so once you have git clone, store it in your Google Drive.

By the way, when I did it with the latest master, there were many troubles such as not being able to do various things that worked well, so The following branches are recommended for me.

git clone -b tf_2_1_reference https://github.com/tensorflow/models.git

Don't forget the coco API.

git clone --depth 1 https://github.com/cocodataset/cocoapi.git

Here, it is assumed that the source code is placed in the following directory.

/content/drive/My Drive/models/research
/content/drive/My Drive/cocoapi/PythonAPI

1-2. Make a new notebook

On your browser, from Google Drive, Select New> Other> Google Colaboratory.

Change the title from "Untitled0.ipynb" to any name you like. (Recommendation)

From the menu above, select "Runtime"-> "Change Runtime Type"-> Specify Hardware Accelerator as "TPU" and "Save". Then select Connect.

1-3. Mount Google Drive

First of all, if you can not read the API source, it will not start, so mount it.

from google.colab import drive
drive.mount('/content/drive')

1-4. GCP project settings

To link with Cloud Storage, set the project with the gcloud command.

from google.colab import auth
auth.authenticate_user()
project_id = 'gcp-project-123'
!gcloud config set project {project_id}
!gsutil ls my-bucket-123

Authentication is similar to Google Drive. If successful, you can check the contents of the bucket with the ls command.

1-5. Installation of cocoAPI (first time only)

%cd /content/drive/My\ Drive/cocoapi/PythonAPI
!make
!cp -r pycocotools /content/drive/My\ Drive/models/research/

1-6. Execution of protoc (first time only)

Convert .proto to .py.

%cd /content/drive/My\ Drive/models/research
!protoc object_detection/protos/*.proto --python_out=.

1-7. Change the version of tensorflow

The Tensorflow Object Detection API does not support tensorflow 2.X. On the other hand, Google Colaboratory has 2.X series installed from the beginning. Therefore, you need to check the version and reinstall.

!pip list | grep tensor
!pip install tensorflow==1.15.0rc3

1-8. Setting environment variables

%env PYTHONPATH=/env/python:/content/drive/My Drive/models/research:/content/drive/My Drive/models/research/slim

1-9. Executing API test code

Let's test whether the environment has been built successfully. If all goes well, you will see "OK" over multiple lines.  By the way, this article uses the source of a slightly older branch, but recently it has been renamed to model_builder_tf1_test.py.

%cd /content/drive/My Drive/models/research
!python object_detection/builders/model_builder_test.py

1-10. Start Tensorboard (not required)

If you specify the directory to save the learning data and start it as shown below, you can check the movement of loss and the number of learning steps per second.

%load_ext tensorboard
%tensorboard --logdir gs://my-bucket-123/data/save

1-11. Start learning

Use model_tpu_main.py instead of model_main.py for training. You can specify the GCP project ID and TPU name as options, but it was not necessary in the Google Colaboratory environment. Probably because the TPU address is originally registered in the environment variable (guess). (If you check with % env, the TPU address followinggrpc: //is registered with the name TPU_NAME)

%cd /content/drive/My Drive/models/research
pipeline = 'gs://my-bucket-123/hoge.config'
save = 'gs://my-bucket-123/data/save'
train_step = 1000
mode = 'train'
batch_size = 64

!python object_detection/model_tpu_main.py \
 --pipeline_config_path={pipeline} \
 --mode={mode} \
 --num_train_steps={train_step} \
 --eval_training_data=True \
 --train_batch_size={batch_size} \
 --model_dir={save} \
 --alsologtostderra

1-12. Bonus (convenient to do)

Google Colaboratory has a usage limit of less than 12 hours, I'm not writing somewhere about how many hours I can actually use. You can find out by running the code below.

import time, psutil
Start = time.time()- psutil.boot_time()
Left= 12*3600 - Start
print('remaining time: ', Left/3600)

Once you start learning, other executions will wait until it finishes, so Let's do this if something has been done.

2. How to launch your own GCP VM and Cloud TPU

By the way, if Google Colaboratory says "I can't use it for a while, please wait" and you can't wait, try this method.

2-1. Launch VM and Cloud TPU

First, you need to have Compute Engine and Cloud TPU enabled. The first time it is displayed as below (image at the time of writing) For Compute Engine "Navigation menu"-> "Compute Engine"-> "VM and instance" in the upper left ss001.png Preparation starts automatically.

For Cloud TPU "Navigation menu"-> "Compute Engine"-> "TPU" on the upper left

ss002.png The first time you need to select "Enable API". (Please be assured that TPU billing will not start with this alone)

If you're ready or already enabled, open Cloud Shell. Cloud Shell has an icon like the one below in the upper right corner. ss003.png After waiting for a while and opening it, start the VM and TPU at the same time with the ctpu command.

ctpu up --zone=us-central1-b --tf-version=1.15 --machine-type=n1-standard-4 --name=mytpu --preemptible --preemptible-vm

The important point here is to put preemptible in the VM and TPU options, That is to use preemptive.

The following table shows the results calculated by the official pricing tool when the location of TPU V2 is us-central1.

TPU Class Regular Preemptible
Per hour About 485 yen About 146 yen

For preemptive information, please refer to the Official Document.

You can do the same from the console or the gcloud command. Details can be found in the official documentation, Creating and Deleting TPUs (https://cloud.google.com/tpu/docs/creating-deleting-tpus?hl=ja).

When you execute the command, a confirmation will be displayed as shown below.

  Name:                 mytpu
  Zone:                 us-central1-b
  GCP Project:          gcp-project-123
  TensorFlow Version:   1.15
  VM:
      Machine Type:     n1-standard-4
      Disk Size:        250 GB
      Preemptible:      true
  Cloud TPU:
      Size:             v2-8
      Preemptible:      true
      Reserved:         false
OK to create your Cloud TPU resources with the above configuration? [Yn]:

Type y and press ʻEnter / returnto start each creation. The reason I chosen1-standard-4` for the machine type is just because it is close to the memory of the Google Colaboratory environment, so change it if necessary.

By the way, if you accidentally delete the default service account of Compute Engine, you will not be able to create it with the above ctpu command.

2020/06/20 00:00:00 Creating Compute Engine VM mytpu (this may take a minute)...
2020/06/20 00:00:07 TPU operation still running...
2020/06/20 00:00:07 error retrieving Compute Engine zone operation: 

(For an error like this ... When did you erase it?) I didn't know the solution so I recreated the new project.

"... Let's go back in time."

2-2. Stop the TPU that has just started

Cloud TPU will be charged in seconds. If you can confirm that it has started up safely, let's stop it for the time being.

2-3. Enter the VM instance created by SSH

When the instance starts successfully, let's enter from "SSH" below.

ss004.png

When the connection was completed, the console screen opened as shown below. ss005.png

From here, we will work in this.

You can also check the TPU status with the gcloud command here.


gcloud config set compute/zone us-central1-b
Updated property [compute/zone].

gcloud compute tpus list
NAME   ZONE           ACCELERATOR_TYPE  NETWORK_ENDPOINTS  NETWORK  RANGE          STATUS
mytpu  us-central1-b  v2-8              10.240.1.2:8470    default  10.240.1.0/29  STOPPING

As an aside, the status of the TPU is displayed as follows.

making During startup Start-up Stopping Stop
CREATING STARTING READY STOPPING STOPPED

2-4. Set alias (not required)

I added this item because I want to clearly unify whether it is python or python3. I always want to use 3.X for python, so change the settings as follows.

Open .bashrc

vi ~/.bashrc

Add settings to the last line

alias python="python3" 
alias pip='pip3'

Reflect settings

source ~/.bashrc

Now python is now python3.

2-5. Installation of required libraries

From here, it will be almost the same as the API environment construction, but I will describe it without omitting it.

sudo apt-get update
sudo apt-get install -y protobuf-compiler python-pil python-lxml python-tk
pip install -U pip && pip install Cython contextlib2 jupyter matplotlib tf_slim pillow

Next, bring the API source code and cocoAPI.

git clone -b tf_2_1_reference https://github.com/tensorflow/models.git
git clone --depth 1 https://github.com/cocodataset/cocoapi.git

Again, the source code for the API used in this article is the branch above.

2-6. Installation of coco API

Then install the coco API.

I failed with make as follows.

x86_64-linux-gnu-gcc: error: pycocotools/_mask.c: No such file or directory

To avoid this, modify the Makefile a bit.

cd cocoapi/PythonAPI
vi Makefile

After opening the Makefile, change the python part to python3 (there are two places).

make
cp -r pycocotools /home/ichilab/models/research && cd ../../ && rm -rf cocoapi

2-7. Executing protoc

Convert .proto to .py.

cd models/research
protoc object_detection/protos/*.proto --python_out=.

2-8. Setting environment variables

Once I closed the SSH screen, I had to do this part again.

(pwd = models/research)
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim
source ~/.bashrc

2-9. Executing API test code

Let's test whether the environment has been built successfully. If successful, "OK" will be displayed over multiple lines.

python object_detection/builders/model_builder_test.py

It may be good to make sure that you can see the contents of the bucket.

gsutil ls gs://my-bucket-123

2-10. Resume the stopped TPU

You can't learn with it stopped, so let's start it again here. If you can confirm the startup, it is next.

2-11. Start learning

Learning has started.

python object_detection/model_tpu_main.py \ 
--tpu_name=mytpu \ 
--model_dir=gs://my-bucket-123/data/save \ 
--mode=train \ 
--pipeline_config_path=gs://my-bucket-123/hoge.config \ 
--alsologtostderra

Write a brief description of the option.

By the way, if I use the latest source here

tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:tpu_worker/replica:0/task:0:

I was quite annoyed by the error. This is the only reason I'm using the source code for the branch I mentioned earlier. The config and other files were under exactly the same conditions, so the cause is unknown at this time.

2-12. Stop / delete TPU and VM after learning

When you're done, give priority to stopping and deleting the TPU.

How much will it cost

After finishing the explanation of environment construction, you are wondering how much it will cost.

I'm sorry I can't post a proper comparison, When you learn 100,000 steps with Google Colaboratory, it costs less than 400 yen. When I started up and used VM and TPU in my project, I have never performed 100,000 Steps, but Considering the above-mentioned charge as a guide, the TPU was 4 yen for using Compute Engine for about 6 hours, and 1 yen for the external IP usage charge, which was less than 10 yen in total.

In this area, using the official price calculation tool is closer to the correct answer than my article.

in conclusion

What did you think?

Surprisingly, I can't find a summary article about the environment construction of Cloud TPU × Tensorflow Object Detection API, so I hope that more people will take this opportunity to learn with TPU and those who are interested in GCP. ..

I sincerely hope that your research on object detection AI will be accelerated as the learning speed is accelerated.

Recommended Posts

[For those who want to use TPU] I tried using the Tensorflow Object Detection API 2
[For beginners] I tried using the Tensorflow Object Detection API
I tried using NVDashboard (for those who use GPU in jupyter environment)
For the time being using FastAPI, I want to display how to use API like that on swagger
[TensorFlow] I want to master the indexing for Ragged Tensor
I tried porting the code written for TensorFlow to Theano
For those who want to start machine learning with TensorFlow2
Logo detection using TensorFlow Object Detection API
I tried using the checkio API
I tried tensorflow for the first time
I tried to classify text using TensorFlow
I tried to touch the COTOHA API
I tried using the BigQuery Storage API
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I analyzed Airbnb data for those who want to stay in Amsterdam
I tried to summarize various sentences using the automatic summarization API "summpy"
I tried using scrapy for the first time
vprof --I tried using the profiler for Python
I tried object detection using Python and OpenCV
I tried using the Google Cloud Vision API
I tried to touch the API of ebay
[I want to classify images using Tensorflow] (2) Let's classify images
I want to use the activation function Mish
I tried to make a ○ ✕ game using TensorFlow
Join Azure Using Go ~ For those who want to start and know Azure with Go ~
Things to keep in mind when using Python for those who use MATLAB
I tried running an object detection tutorial using the latest deep learning algorithm
I tried object detection with YOLO v3 (TensorFlow 2.1) on the GPU of windows!
I want to use self in Backpropagation (tf.custom_gradient) (tensorflow)
I tried the MNIST tutorial for beginners of tensorflow.
I tried to approximate the sin function using chainer
Anxible points for those who want to introduce Ansible
I tried using the API of the salmon data project
For those who want to write Python with vim
I want to automate ssh using the expect command!
I tried to identify the language using CNN + Melspectogram
I want to use the R dataset in python
I tried to complement the knowledge graph using OpenKE
I tried to compress the image using machine learning
I tried to scrape YouTube, but I can use the API, so don't do it.
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
The first step of machine learning ~ For those who want to implement with python ~
I tried to score the syntax that was too humorous and humorous using the COTOHA API.
I want to use the latest gcc without sudo privileges! !!
[First COTOHA API] I tried to summarize the old story
I tried to find the average of the sequence with TensorFlow
I want to move selenium for the time being [for mac]
I want to see something beautiful, so I tried to visualize the function used for benchmarking the optimization function.
I tried to simulate ad optimization using the bandit algorithm.
[Python] I want to use the -h option with argparse
I didn't know how to use the [python] for statement
I want to use the Ubuntu desktop environment on Android for the time being (Termux version)
[TF] I tried to visualize the learning result using Tensorboard
I want to use Ubuntu's desktop environment on Android for the time being (UserLAnd version)
Miscellaneous notes that I tried using python for the matter
[Python] I tried collecting data using the API of wikipedia
I tried the Google Cloud Vision API for the first time
Reference reference for those who want to code in Rhinoceros / Grasshopper
I tried using magenta / TensorFlow
An introduction to Web API development for those who have completed the Progate Go course
I tried to sort out the objects from the image of the steak set meal-① Object detection