MessagePack is a data serialization format proposed and implemented by Mr. Furuhashi of Treasure Data, and is well known as a format for fluentd to communicate with each other. And I think you are. Personally, I often use this to store data when analyzing data, and this article will introduce the method.

Although I try to write the article as accurately as possible, I am relatively amateur about data analysis, so I welcome your suggestions and comments.

What to do with MessagePack

All you want to do is save the ** data you got for analysis in MessagePack format to a file and read that data at the time of analysis **.

Normally, when performing data analysis etc., it seems that it is the royal road to store it once in a database etc. and then perform the analysis, but in the following cases I think that there is a merit of saving and reading in MessagePack format. ..

If you are happy with MessagePack

When the schema handles unclear data

For example, if the data output from a certain system is JSON in dictionary format, the keys contained in each record are different, the format of what is entered as a value is different, and it is unknown whether the dictionary format or the array format comes. Or something like that. Furthermore, even if the key or value changes, it is good if such a schema or specification is properly defined, but there are cases where there is no choice but to estimate from the actual data without even the specification.

In such a case, MessagePack can convert the hierarchical structure such as dictionaries and arrays almost as it is and use it, so it is easy to put it in the file without thinking about anything for the time being. When retrieving data for data analysis, data I / O often becomes a bottleneck, so saving it locally in a file format makes it easier to perform trial and error later.

When you want to save data temporarily or when you want to repeatedly save and load data

There are many ways to say data analysis, but it is often necessary to perform "search-type data analysis" to grasp the whole picture and formulate a hypothesis while kneading the data at the stage when a sufficient hypothesis cannot be formulated. In such a case, you can convert the data to a format that is easy to process, extract only the necessary data and save it, and so on. In that case, I think that there may be trial and error regarding the data format, such as "I want to add this attribute for a moment" or "It seems better to save it as a dictionary type instead of a list".

When processing while flexibly changing the data format in this way, when the code side is changed by defining the schema again in a place other than the code that processes the data (for example, creating a table with SQL), etc. Inconsistencies can occur in each case, which can be very annoying. On the other hand, if you want to save the data in MessagePack format, you need to match the data format for both writing and reading, but the work is minimal.

(However, if you don't leave comments to some extent, you may find that even if you review the code yourself a few months later, you won't understand the meaning at all ...)

When you need to save and load data at high speed (but full text)

DB as middleware not only simply stores data, but also provides various functions such as setting keys and guaranteeing reliability, so it takes more overhead than simply writing to disk. .. This may be solved by techniques such as load distribution, but if you simply want to "save a little data", it will take time and effort to convert it to MessagePack format and write it directly as a file. No, the file write / read performance is applied almost as it is.

However, the reliability is the same as writing to a file, and it is assumed that all data is read when reading.

If you are not happy with MessagePack

On the other hand, of course, there are cases where it is not suitable to save in MessagePack format and analyze the data.

When using data with a clear schema or data originally organized in a DB etc. --You should use the function provided by DB quietly
For analyzes such as search and aggregation that benefit greatly from creating keys and indexes. --When reading data, the processing increases linearly with respect to the amount of data because basically all searches are performed. ――As a guide, the upper limit of the data saved by MessagePack is about 1 million. If it becomes more than that, you should seriously consider storing it in the DB etc.
When multiple people handle the same data --This time, it is assumed that one file is read and written, so locking is not considered.
To ensure confidentiality, integrity and availability of data --When exporting to a single file, it is not possible to control access on a record-by-record basis. ――It is easy to accidentally delete a file, so you need to carefully consider that.

Alternative proposal

The following technologies can be considered as alternatives. Please select according to the situation.

CSV

If your data is in a nearly fixed length column format and is not hierarchical, it is better to use CSV (or TSV). However, it is easier to use MessagePack when variable length elements and hierarchical structures are included.

MongoDB

You can insert without defining a schema in a document-oriented DB, so you can do the same from the point of saving data for the time being. However, the performance of insert seems to be 3,500 insert / sec too much, and when writing with MessagePack, it only writes directly to the disk, so Saving with MessagePack, which retains the performance of disk I / O, is overwhelmingly faster. However, if you want to create keys etc. later, MongoDB will be more suitable.

JSON, BSON

It is disadvantageous compared to MessagePack in terms of processing speed and data size (reference). In addition, JSON has a module such as ijson that reads out in stages when trying to put multiple objects in one data segment (for example, one file). If you don't use it, you have to parse it all at once, so if the number of data is large, the processing will be difficult on a poor machine. On the other hand, if you write it like ijson, the code will be complicated, so I personally think that it is easier to store data continuously in one data segment and retrieve it obediently.

Protocol Buffers

Protocol Buffers is famous as one of the serialization technologies, but it takes time to handle data whose schema is unclear because it is necessary to define the schema on the processing code side. When viewed as data serialization as an interface between software, it is convenient because it regulates the schema, but it becomes difficult to handle in cases where you do not know what kind of schema data will come.

Sample code

There is a lot of explanation on the Official page, so I don't need to talk much, but I will introduce the sample code focusing only on writing and reading to the file. The code can also be found on github.

In all cases, the following data shall be written / read to the data.msg file.

{
  "name": "Alice", 
  "age": 27,
  "hist": [5, 3, 1]
}
{
  "name": "Bob", 
  "age": 33,
  "hist": [4, 5]
}

Python

Package msgpack-python is required.

Installation

$ pip install msgpack-python

Write sample code

# coding: UTF-8

import msgpack

obj1 = {
    "name": "Alice",
    "age": 27,
    "hist": [5, 3, 1]
}
obj2 = {
    "name": "Bob",
    "age": 33,
    "hist": [4, 5]
}

with open('data.msg', 'w') as fd:
    fd.write(msgpack.packb(obj1))
    fd.write(msgpack.packb(obj2))

Read sample code

# coding: UTF-8

import msgpack

for msg in msgpack.Unpacker(open('data.msg', 'rb')):
    print msg

Ruby

Package msgpack is required.

$ gem install msgpack

Write sample code

# -*- coding: utf-8 -*-

require "msgpack"

obj1 = {
    "name": "Alice",
    "age": 27,
    "hist": [5, 3, 1]
}
obj2 = {
    "name": "Bob",
    "age": 33,
    "hist": [4, 5]
}

File.open("data.msg", "w") do |file|
  file.write(obj1.to_msgpack)
  file.write(obj2.to_msgpack)
end

Read sample code

# -*- coding: utf-8 -*-

require "msgpack"

File.open("data.msg") do |file|
  MessagePack::Unpacker.new(file).each do |obj|
    puts obj
  end
end

Node

There are some major libraries for MessagePack, but this time I will use msgpack-lite for the code.

$ npm install msgpack-lite

Write sample code

const fs = require('fs');
const msgpack = require('msgpack-lite');

const obj1 = {
  name: "Alice",
  age: 27,
  hist: [5, 3, 1]
};
const obj2 = {
  name: "Bob",
  age: 33,
  hist: [4, 5]
};

fs.open('data.msg', 'w', (err, fd) => {
  fs.writeSync(fd, msgpack.encode(obj1));
  fs.writeSync(fd, msgpack.encode(obj2));
});

Read sample code

const fs = require('fs');
const msgpack = require('msgpack-lite');

var rs = fs.createReadStream('data.msg');
var ds = msgpack.createDecodeStream();

rs.pipe(ds).on('data', (msg) => {
  console.log(msg);
});

C++

The msgpackc library is required. For macOS, you can install it with brew.

$ brew install msgpack

Write sample code

#include <msgpack.hpp>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
  int fd = open("data.msg", O_WRONLY | O_CREAT, 0600);

  msgpack::sbuffer buf1, buf2;;
  msgpack::packer<msgpack::sbuffer> pk1(&buf1), pk2(&buf2);

  pk1.pack_map(3);
  pk1.pack("name"); pk1.pack("Alice");
  pk1.pack("age");  pk1.pack(27);
  pk1.pack("hist");
  pk1.pack_array(3);
  pk1.pack(5); pk1.pack(3); pk1.pack(1);

  write(fd, buf1.data(), buf1.size());


  pk2.pack_map(3);
  pk2.pack("name"); pk2.pack("Bob");
  pk2.pack("age");  pk2.pack(33);
  pk2.pack("hist");
  pk2.pack_array(2);
  pk2.pack(4); pk2.pack(5);

  write(fd, buf2.data(), buf2.size());

  close(fd);
  return 0;
}

Read sample code

#include <msgpack.hpp>
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>

int main(int argc, char *argv[]) {
  static const size_t BUFSIZE = 4; //Dare to make the buffer size smaller
  int rc;
  char buf[BUFSIZE];

  int fd = open("data.msg", O_RDONLY);

  msgpack::unpacker unpkr;
  while (0 < (rc = read(fd, buf, sizeof(buf)))) {
    unpkr.reserve_buffer(rc);
    memcpy(unpkr.buffer(), buf, rc);
    unpkr.buffer_consumed(rc);

    msgpack::object_handle result;
    while (unpkr.next(result)) {
      const msgpack::object &obj = result.get();

      if (obj.type == msgpack::type::MAP) {
        printf("{\n");
        msgpack::object_kv* p(obj.via.map.ptr);

        for(msgpack::object_kv* const pend(obj.via.map.ptr + obj.via.map.size);
            p < pend; ++p) {

          std::string key;
          p->key.convert(key);

          if (key == "name") {
            std::string value;
            p->val.convert(value);
            printf("  %s: %s,\n", key.c_str(), value.c_str());
          }

          if (key == "age") {
            int value;
            p->val.convert(value);
            printf("  %s: %d,\n", key.c_str(), value);
          }

          if (key == "hist") {
            msgpack::object arr = p->val;
            printf ("  %s, [", key.c_str());
            for (int i = 0; i < arr.via.array.size; i++) {
              int value;
              arr.via.array.ptr[i].convert(value);

              printf("%d, ", value);
            }
            printf ("],\n");
          }
        }

        printf("}\n");
      }

      result.zone().reset();
    }
  }

  return 0;
}

By the way, if you throw the msgpack :: object format into ʻostream (std :: cout` etc.), the format will be formatted and displayed without permission, but it is troublesome as described above to retrieve the value programmatically. The procedure is described as a sample.

Recommendation of data analysis using MessagePack

What to do with MessagePack

If you are happy with MessagePack

When the schema handles unclear data

When you want to save data temporarily or when you want to repeatedly save and load data

When you need to save and load data at high speed (but full text)

If you are not happy with MessagePack

Alternative proposal

CSV

MongoDB

JSON, BSON

Protocol Buffers

Sample code

Python

Ruby

Node

C++