In this article, I would like to compare the processing time of converting Apache Arrow and JSON to columnar format Yosegi. think.

In the first place, the writing processing performance of Yosegi is compared with ORC and Parquet in Article I wrote last time, so please refer to it.

Preparing Yosegi

Yosegi has CLI, so use this.

First, clone from GitHub. Run the setup command after compiling this jar.

$ git clone https://github.com/yahoojapan/yosegi-tools
$ mvn clean package
$ ./bin/setup.sh

This makes the command available. There is a sample JSON file, you can check the operation here.

$ ./bin/yosegi.sh create -i ./etc/sample_json.txt -o /tmp/a.json -f json
$ ./bin/yosegi.sh cat -i /tmp/a.json -o "-"
{"summary":{"total_price":550,"total_weight":412},"number":5,"price":110,"name":"apple","class":"fruits"}
{"summary":{"total_price":800,"total_weight":600},"number":10,"price":80,"name":"orange","class":"fruits"}

Data used for comparison

Use the data obtained by converting the lineitem of TPC-H to JSON. The data size is about 3.5GB.

$ ls -l /tmp/lineitem.json
-rwxrwxrwx 1 hoge hoge 3665538085 Mar  6 14:24 /tmp/lineitem.json

Verification work

JSON to Yosegi conversion

Convert from JSON to Yosegi. The processing time was about 2 minutes and 41 seconds.

$ time HEAP_SIZE=1g ./bin/yosegi.sh create -i /tmp/lineitem.json -o /tmp/lineitem.yosegi.gz -f json

real    2m41.488s
user    1m47.595s
sys     0m44.432s

Conversion from Yosegi to Apache Arrow

Converts the output Yosegi file to Apache Arrow format. The processing time was about 16 seconds.

$ time HEAP_SIZE=1g ./bin/yosegi.sh to_arrow -i /tmp/lineitem.yosegi.gz -o /tmp/lineitem.arrow

real    0m16.134s
user    0m9.927s
sys     0m2.392s

The data size will be approximately 1.5GB. Since there is little Key information, it is significantly less than JSON.

$ ls -l /tmp/lineitem.arrow
-rwxrwxrwx 1 hoge hoge 1566760242 Mar  6 16:24 /tmp/lineitem.arrow

Conversion from Apache Arrow to Yosegi

Converts the output Apache Arrow file to Yosegi format. The processing time was about 1 minute and 9 seconds.

$ time HEAP_SIZE=1g ./bin/yosegi.sh from_arrow -i /tmp/lineitem.arrow -o /tmp/lineitem_from_arrow.yosegi.gz

real    1m9.046s
user    1m4.562s
sys     0m1.558s

Yosegi to JSON conversion

Finally, perform the Yosegi to JSON conversion. The processing time was about 14 minutes and 43 seconds.

$ time HEAP_SIZE=1g ./bin/yosegi.sh cat -i /tmp/lineitem.yosegi.gz -f json -o /tmp/lineitem_from_yosegi.json

real    14m42.576s
user    0m10.492s
sys     8m23.839s

Since the writing destination is a disk, I will set it to / dev / null for comparison.

The JSON conversion took about 46 seconds, and the Apache Arrow conversion took about 10 seconds.

$ time HEAP_SIZE=1g ./bin/yosegi.sh cat -i /tmp/lineitem.yosegi.gz -f json -o "-" > /dev/null

real    0m45.748s
user    0m44.851s
sys     0m0.429s

$ time HEAP_SIZE=1g ./bin/yosegi.sh to_arrow -i /tmp/lineitem.yosegi.gz -o "-" > /dev/null

real    0m10.141s
user    0m9.431s
sys     0m0.347s

Summary of comparison

processing	JSON time	Apache Arrow time	Apache Arrow time/JSON time
Write to Yosegi	161s	69s	0.43
Read from Yosegi	46s	10s	0.22

Apache Arrow is about 2.3 times faster for writing and 4.55 times faster for reading than JSON.

Summary

--In the write process, JSON is added to the process of saving it in memory with the column data structure. Apache Arrow has a column data structure, so it can be converted as it is. Therefore, the process for temporarily saving in the memory is omitted. It is presumed that the reason why the difference is small compared to reading is that the processing time for compression is large.

--In the reading process, JSON is loaded into memory with the column data structure, then read in message units and converted to JSON. Apache Arrow has a column data structure, so it can be loaded as is. Therefore, the process of reading and the process of converting each message are omitted.

From this, it can be said that it is efficient to convert each other via Apache Arrow. Assuming that Apache Arrow will become widespread in the future, it may be more efficient to design input / output based on Apache Arrow. This article has described data formats, but we also want to better understand the exchange of data between languages.

bonus

The Apache Arrow binary can be easily processed with python. Yosegi itself only supports Java, but it can be easily linked by going through Apache Arrow.

import pyarrow as pa

reader = pa.RecordBatchFileReader( pa.OSFile( "/tmp/lineitem.arrow" ) )

rb = reader.get_record_batch(0)
df = rb.to_pandas()
print( df["l_linestatus"].value_counts() )

The execution result of the above program is as follows.

$ time python a.py
F    25129
O    24871
Name: l_linestatus, dtype: int64

real    0m0.327s
user    0m0.269s
sys     0m0.042s

at the end

Yosegi, which is currently developing as OSS, is looking for users and developers! If you have any interest in Yosegi, please feel free to contact us!

Compared processing time to convert Apache Arrow and JSON to Yosegi