Detect Japanese characters from images using Google's Cloud Vision API in Python

Introduction

Hands-on article to review and fix the knowledge gained by developing Serverless Web application Mosaic / 87b57dfdbcf218de91e2), and I'm planning a total of 17 articles, currently 13 articles and 4 more, but I'm a little tired of it. I'm getting tired of it, or I want to start implementing new features. That's why I started. To add a character detection function.

Character detection? Character recognition? It sounds like OCR (Optical Character Recognition). It seems that characters can be detected even with AWS Rekognition, but unfortunately it seems that Japanese is not supported. (As of January 2020. I haven't tried it, but it seems that it is not supported yet.) However, Google's Cloud Vision API also supports Japanese, so I decided to use this.

Cloud Vision API enabled

So let's do it right away. Access the Google Cloud Platform or Google Developpers console> APIs and services. https://console.cloud.google.com/apis https://console.developers.google.com/apis

Press the "+ Enable APIs and Services" button at the top of the screen. Screenshot 2020-01-11 at 11.55.13.png Screenshot 2020-01-11 at 11.56.22.png

Find and enable the Cloud Vision API. Screenshot 2020-01-11 at 11.56.55.png Screenshot 2020-01-11 at 11.58.07.png

A service account will be added as authentication information, but it will be omitted here because it overlaps with this article.

Use Google Vision API with Lambda (Python3)

For importing the library required to call Google's API with the service account, please refer to this article as before.

The code that passes the local image file to the Vision API and detects faces and text looks like this:

`lambda_function.py`


  : 
def detectFacesByGoogleVisionAPIFromF(localFilePath, bucket, dirPathOut):
    try:
        keyFile = "service-account-key.json"
        scope = ["https://www.googleapis.com/auth/cloud-vision"]
        api_name = "vision"
        api_version = "v1"
        service = getGoogleService(keyFile, scope, api_name, api_version)
        
        ctxt = None
        with open(localFilePath, 'rb') as f:
            ctxt = b64encode(f.read()).decode()
        
        service_request = service.images().annotate(body={
            "requests": [{
                "image":{
                    "content": ctxt
                  },
                "features": [
                    {
                        "type": "FACE_DETECTION"
                    }, 
                    {
                        "type": "TEXT_DETECTION"
                    }
                ]
            }]
        })
        response = service_request.execute()
        
    except Exception as e:
        logger.exception(e)

def getGoogleService(keyFile, scope, api_name, api_version):
    credentials = ServiceAccountCredentials.from_json_keyfile_name(keyFile, scopes=scope)
    return build(api_name, api_version, credentials=credentials, cache_discovery=False)

In this sample code, FACE_DETECTION and TEXT_DETECTION are specified. In addition to that, you can specify LABEL_DETECTION, LANDMARK_DETECTION, LOGO_DETECTION, etc., but if you specify it, the charge will be added accordingly. Therefore, if the purpose is only character detection, it is better to specify only TEXT_DETECTION. Screenshot 2020-01-12 at 14.58.06.png I won't go into the details of what can be specified for features and the json returned. See the Cloud Vision API docs (https://cloud.google.com/vision/docs?hl=ja). Click here for pricing details (https://cloud.google.com/vision/pricing?hl=ja).

I really wanted to Vision for files uploaded to Google Drive

But I couldn't.

`test.py`


        imageUri = "https://drive.google.com/uc?id=" + fileID
        service_request = service.images().annotate(body={
            "requests": [{
                "image":{
                    "source":{
                      "imageUri": imageUri
                    }
                  },
                "features": [
                    {
                        "type": "FACE_DETECTION"
                    }, 
                    {
                        "type": "TEXT_DETECTION"
                    }
                ]
            }]
        })
        response = service_request.execute()

I thought it would be cool like this, but I got the following error and tried various things, but I couldn't do it after all.

`response.json`


{"responses": [{"error": {"code": 7, "message": "We're not allowed to access the URL on your behalf. Please download the content and pass it in."}}]}
{"responses": [{"error": {"code": 4, "message": "We can not access the URL currently. Please download the content and pass it in."}}]}ass it in."}}]}

I can call Vision, but it seems that I can't access the image (imageUri) on the Web from Vision. I couldn't even specify a direct link to the image on the S3 Public bucket. Mystery is. Please let me know.

Operation check

TEXT_DETECTION gives you two elements: textAnnotations and fullTextAnnotation. The result of enclosing the area detected by textAnnotations in blue and the area detected by fullTextAnnotation in green was as follows.

I think that the results are reasonable, but I get the impression that the range of one character is slightly different for the blue one, probably because it is in Japanese.

Also, it seems to be weak when each character is independent as shown in the image below. Whether this is also a problem peculiar to Japanese, whether it can be detected if it is a different feature, or whether it can be adjusted with other parameters, I have not pursued it deeply, but I have not obtained the expected result anyway. It is a pain to hope that the API learning will progress in the future and the characters in this image can be detected without omission. For Akatsuki, where AWS Rekognition supports Japanese for character detection, I'm thinking of evaluating this image first!

Afterword

I know it's important and necessary to put it together as an article, but it's the most fun when I'm creating something new. With that flow, I was able to write this Vision API article smoothly. It would be a hassle to write an article later, so I thought it might be important to write it quickly while I was new to my memory and when I was in high tension.

I'm using a serverless web application called Mosaic that I'm making now, basically AWS infrastructure, but eventually I'd like to build the same one with GCP. By the end of 2020. I think it would be better to build the infrastructure in a unified manner, but for Web API services such as Rekognition and Vision API, either one is fine, or the one that can achieve what you want to do will be selected. Both of them just call the API, so it's not a big deal, and it may be the same for any system after all, but it makes the feeling like a plastic model made by combining parts stronger. I don't think you need to know everything, but I don't think it's good to stick to one too much.

It's great to see how easy it is to use machine-learned, high-precision services, but it's painful to get stuck if it doesn't return the results you expect. However, there is nothing I can do in my future research field, so I can only sit and wait for the results I expect to come out one day.