Character strings placed in GCS with python are garbled when viewed with a browser

tl;dr

--GCS can specify Content-Type --Chrome tries to display text / plain in Shift-JIS --text / plain; charset = utf-8 is kind

phenomenon

Create an object in Google Cloud Storage with a script similar to the following: This is a modified version of the googleapis / python-storage Example Usage code to store strings containing Japanese. Bucket creation and authentication are not relevant here, so I will skip them.

from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket-id-here')
blob = bucket.get_blob('remote/path/to/file.json')
blob.upload_from_string('{"name": "Japanese"}')

After executing the script, I would like to check with the browser whether the object was created. スクリーンショット 2020-04-25 1.05.50.png

The object seems to have been created successfully. Let's take a look at the contents. スクリーンショット 2020-04-25 2.23.36.png

Google Cloud Storage has the ability to generate temporary links, and you can download objects by following the link from your browser.

スクリーンショット 2020-04-25 1.07.30.png

The contents of the object have been garbled like this. This is the problem this time.

file.json


{"name": "譌 ・ 譛 ャ 隱."}

Investigation

encode forget hypothesis

I often forget to encode and decode strings. ʻUpload_from_string` is passed str type, but check if it was not necessary to encode to UTF-8 etc.

Looking at the code, what ʻupload_from_string` is doing is simple.

Excerpt


def upload_from_string(abridgement):
    data = _to_bytes(data, encoding="utf-8")
    string_buffer = BytesIO(data)
    self.upload_from_file(abridgement)
  1. Make the data a UTF-8 byte string
  2. Make the previous byte string treat like a file
  3. Call ʻupload_from_file`

From the above, the string encode seems to be fine.

Content-Type hypothesis

By the way, when I was looking at the object information in the browser, I found a part that I was interested in.

スクリーンショット 2020-04-25 2.23.36.png

type="text/plain"

GCS gives an object metadata. It seems that you can specify the Response Header when the object is called in the Content-Type metadata. https://cloud.google.com/storage/docs/metadata#content-type

By default it should be ʻapplication / octet-stream or ʻapplication / x-www-form-urlencoded, but it seems that this is text / plain. Is this the cause?

Experiment

I hypothesized that the cause of the garbled characters was Content-Type: text / plain, so I will set up a server at hand and check the display in order to separate it from GCS. Set up a server that simply returns a string with Content-Type: text / plain with bottle.

server.py


from bottle import Bottle, HTTPResponse
import os

app = Bottle()

@app.route('/')
def serve():
    r = HTTPResponse(status=200, body='Hoge')
    r.set_header('Content-Type', 'text/plain')
    return r

if __name__ == '__main__':
    port = os.environ['PORT'] if 'PORT' in os.environ else '3000'
    app.run(host='0.0.0.0', port=port)

Open in browser

スクリーンショット 2020-04-25 2.28.16.png

I reproduced it.

Next, open it in your browser with ʻapplication / json`.

スクリーンショット 2020-04-25 2.34.56.png

If it is ʻapplication / json`, it will be displayed correctly.

From the above, it seems good to think that Content-Type: text / plain is the cause of garbled characters regardless of GCS.

Why Content-Type: text / plain

The question remains as to why Content-Type, which should have been ʻapplication / octet-stream or ʻapplication / x-www-form-urlencoded by default in GCS, was now text / plain. This is the Blob.upload_from_string of the module google.cloud.storage used for upload. # L1650-L1660) is doing something wrong

Excerpt


    def upload_from_string(
        self,
        data,
        content_type="text/plain",
        client=None,
        predefined_acl=None,
        if_generation_match=None,
        if_generation_not_match=None,
        if_metageneration_match=None,
        if_metageneration_not_match=None,
    ):

Since the default argument of content_type is text / plain, it is implemented as Content-Type: text / plain unless otherwise specified.

Browser behavior

In the past, it seems that the viewer could change the character encoding, but now it seems that the browser's automatic inference only.

> document.characterSet
"Shift_JIS"

I'm trying to display in Shift_JIS

The size of the problem

Up to this point, we have found that the following two points are the causes of garbled characters. --The Response Header when GCS returns an object is Content-Type: text / plain --In Chrome browser, trying to display Content-Type: text / plain with Shift_JIS is garbled

The characters are garbled when viewed with a browser, but there is no problem when processing with a program as follows.

from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket-id-here')
blob = bucket.get_blob('remote/path/to/file.txt')
print(blob.download_as_string())

In my case, the saved object is read by a program anyway, so it was actually a problem when I wanted to see the contents easily. It can be a problem if you are using it as static file hosting.

solution

When using upload_from_string, it is good to specify Content-type. If it is json, you can set it to ʻapplication / json, and if it is text, you can specify charset like text / plain; charset = utf-8`, but Chrome will read it with utf-8.

Summary

I wasn't careful because it rarely changes recently. There aren't many lessons learned this time, so be careful when using Blob.upload_from_string.

Recommended Posts

Character strings placed in GCS with python are garbled when viewed with a browser
Character encoding when dealing with files in Python 3
[Beginner] Extract character strings with Python
When writing a program in Python
Spiral book in Python! Python with a spiral book! (Chapter 14 ~)
Precautions when pickling a function in python
I made a character counter with Python
Display Python 3 in the browser with MAMP
Display pyopengl in a browser (python + eel)
Execution order when multiple context managers are specified in the Python with statement
Error when installing a module with Python pip
[Python] Get the files in a folder with Python
Read a file containing garbled lines in Python
Create a virtual environment with conda in Python
Precautions when dealing with control structures in Python 2.6
A memo when creating a python environment with miniconda
Work in a virtual environment with Python virtualenv.
Create a new page in confluence with Python
What are you using when testing with Python?
Consideration when you can do a good job in 10 years with Python3 and Scala3.
A memo when checking whether the specified key exists in the defined dictionary with python
I got stuck when trying to specify a relative path with relative_to () in python
[Python] Leave only the elements that start with a specific character string in the array
How to convert / restore a string with [] in python
Playing with a user-local artificial intelligence API in Python
Make a simple Slackbot with interactive button in python
[Python] How to expand variables in a character string
Try searching for a million character profile in Python
Precautions when dealing with ROS MultiArray types in Python
Try embedding Python in a C ++ program with pybind11
Mailbox selection when retrieving Gmail with imaplib in python
Problems when creating a csv-json conversion tool with python
I want to work with a robot in python.
A addictive point in "Bayesian inference experience with Python"
Use communicate () when receiving output in a Python subprocess
Japanese output when dealing with python in visual studio
Run a Python file with relative import in PyCharm
Things to keep in mind when processing strings in Python2
Things to keep in mind when processing strings in Python3
For those who are in trouble because NFC is read infinitely when reading NFC with Python
A story that went missing when I specified a path starting with a tilde (~) in python open
Machine learning A story about people who are not familiar with GBDT using GBDT in Python
Compare strings in Python
Reverse strings in Python
[Python3] Be careful with removing character strings (strip, lstrip, rstrip)
Try running python in a Django environment created with pipenv
I made a simple typing game with tkinter in Python
Display character strings without line breaks in python (personal memo)
A general-purpose program that formats Linux command strings in python
[Python] How to make a list of character strings character by character
Published a library that hides character data in Python images
[Small story] [Python] Replace strings in 2D arrays with numbers
Create a child account for connect with Stripe in Python
Let's create a script that registers with Ideone.com in Python.
Behavior when giving a list with shell = True in subprocess
A memo when face is detected with Python + OpenCV quickly
[python] A note when trying to use numpy with Cython
A memo when creating a directed graph using Graphviz in Python
Behavior in each language when coroutines are reused with for
Things to keep in mind when using Python with AtCoder
Things to keep in mind when using cgi with python.