[PYTHON] Data handling 3 (development) About data format

Aidemy 2020/10/

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the third post on data handling. Nice to meet you.

What to learn this time ・ About Protocol Buffers ・ About hdf5 ・ About TF Record

About Protocol Buffers (Development)

What are Protocol Buffers?

Protocol Buffers are used by Google to store data and exchange all kinds of structured information. (Quote: wikipedia Protocol Buffers "https://ja.wikipedia.org/wiki/Protocol_Buffers")

-As a data processing method, Message Type is defined in advance. -Message Type is like a class and is defined in the language proto2.

Message Type definition

・ First, let's see how to write it with reference to the source code of Message Type that summarizes the family structure.

-Code![Screenshot 2020-10-28 22.39.00.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/c16c44d0-1883-12cd- a3b4-81fd090d4650.png)

-Declare the use of proto2 with __ "syntax =" proto2 ";" . Be sure to add ";" at the end of the line. - "Message Person {}" __ represents a class called "Person". -Comments can be represented by __ "//" __ on one line and __ "/ * * /" __ on multiple lines. -For "required string name = 1;", __ "string name" __ indicates that "name" is str type. These two words are collectively called field. The __ "= 1" __ part is called the __ tag __ and has the role of distinguishing the data when outputting the data. __ "required" __ must be added to "required items". -Similarly, if "required int32 age = 2;", it means that "age" is an int type and the tag is 2.

-"Enum Relationship {}" newly defines "types" such as str and int. Here, it is called "Relationship type". -In enum, it is necessary to add a new tag to each value (MOTHER, etc.). Tags in enums start with "0". -"Required Relationship relationship = 4;" is the same as "str name", which means that "relationship" is a Relationship type and the tag is 4.

-"Message Family {}" is the same as Person and represents Family class. -__ "Repeated" __ of "repeated Person person = 1;" is like a "list", in which case Person type data is listed.

Make Message Type available in python

-The file in which the above code is written is called "family.proto". To make this file workable with python, use a command __protoc --python_out = Save destination file path Message Type File name __ Enter.

Write data in Python

-By reading the Message Type file in python, you can use the type defined in it (Family type etc. in family.proto). Use this to actually enter data in python.

・ Code![Screenshot 2020-10-28 22.40.57.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/9b0bb720-d9c4-5b63- b475-8c496b12dd63.png)

・ Result (only part)![Screenshot 2020-10-28 22.41.42.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/b2c53d35 -0695-7324-2d0f-fc31d5fe81b7.png)

About hdf5

What is hdf5

-Hdf5 is a data format used in keras, for example, it is used when saving a learning model created in keras. -In hdf5, __hierarchical structure can be completed in one file. __ In other words, even if multiple folders (directories) are created hierarchically, files can be created comprehensively on the hdf5 side.

Create hdf5 file

-Use a library called h5py and Pandas to create it. -In the following, an hdf5 file will be created using the population of prefecture A as an example.

-Code![Screenshot 2020-10-28 22.43.22.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/aec2bdea-e1da-2327- 6f2b-9684858f1cbb.png)

· Open hdf5 file: __ hdf5.File ("filename") __ -Create group (directory): __ file .create_group ("group name") __ -Write the file with __flush () __ and close it with __close () __.

About TF Record

What is TF Record

TFRecord is a simple record-oriented binary format that allows you to process large amounts of data that can't fit in memory. Quote: How to create and read tdl TensorFlow recommended format "TFRecord" [https://www.tdi.co.jp/miso/tensorflow-tfrecord-01#:~:text=TFRecord%E3%81%AF%E3%80%81%E3%80%8C%E3%83%A1%E3%83%A2%E3%83%AA%E3%81%AB%E5%8F%8E%E3%81%BE%E3%82%89,%E3%81%AE%E3%83%95%E3%82%A9%E3%83%BC%E3%83%9E%E3%83%83%E3%83%88%E3%80%8D%E3%81%A8%E3%81%84%E3%81%86%E3%81%93%E3%81%A8%E3%81%A7%E3%81%99%E3%80%82]

-TFRecord is a data format used in TensorFlow, which enables processing of large amounts of data as described above.

Export the image to another file in TFRecord format.

-The flow is like "reading an image", "defining what to write out", and "writing". ・ Actually do the following (file path is fictitious)

-Code![Screenshot 2020-10-28 22.45.18.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/259f8e5c-c321-d243- 1ae8-5ba83afb733a.png)

-For the "Definition of data to be exported", __ "tf.train.Example ()" "tf.train.Features ()" "tf.train.Feature ()" "tf.train.ByteList ()" __ Many instances such as are generated hierarchically, but each has a role. -For "tf.train.ByteList (value = [data])", this creates an instance with the data in __ [] __. This data needs to be byte type, so use __tobytes () __. -"'Key': tf.train.Feature ()" creates a __Feature instance with a __ key from a ByteList instance. -"Tf.train.Features ()" is a __dictionary that collects multiple Feature instances. -"Tf.train.Example ()" creates an Example instance from the Features instance. This allows you to write to files.

-For the "write" part, __ "tf.python_io.TFRecordWriter ('filename')" __ is the TFRecord version of "open ('w')". -__ "Fp.write (my_Example.SerializePartialToString ())" __, complete if you finally write.

List of variable and fixed lengths

-The list has a general __variable length __ whose length can be changed and a __fixed length __ where only fixed data can be entered. -The list of python is usually variable length, but "tf.train.Example ()" in the previous section is fixed length. -When generating variable length data, use __ "tf.train.SequenceExample ()" __.

This time is over. Thank you for reading to the end.

Recommended Posts

Data handling 3 (development) About data format
Data handling
Python Application: Data Handling Part 3: Data Format
Multi-condition data handling
Data handling
About tweepy error handling
About Python development environment
Python, about exception handling
About FastAPI ~ Endpoint error handling ~
Write data in HDF format
About handling Django static files
About data management of anvil-app-server
Handle NetCDF format data in Python
Parsing CSV format data using SQL
About time series data and overfitting
Merge JSON format data with Ansible
Data handling 2 Analysis of various data formats