Compared processing time to convert Apache Arrow and JSON to Yosegi

In this article, I would like to compare the processing time of converting Apache Arrow and JSON to columnar format Yosegi. think.

Preparing Yosegi

Yosegi has CLI, so use this.

First, clone from GitHub. Run the setup command after compiling this jar.

$ git clone https://github.com/yahoojapan/yosegi-tools
$ mvn clean package
$ ./bin/setup.sh

This makes the command available. There is a sample JSON file, you can check the operation here.

$ ./bin/yosegi.sh create -i ./etc/sample_json.txt -o /tmp/a.json -f json
$ ./bin/yosegi.sh cat -i /tmp/a.json -o "-"
{"summary":{"total_price":550,"total_weight":412},"number":5,"price":110,"name":"apple","class":"fruits"}
{"summary":{"total_price":800,"total_weight":600},"number":10,"price":80,"name":"orange","class":"fruits"}

Data used for comparison

Use the data obtained by converting the lineitem of TPC-H to JSON. The data size is about 3.5GB.

$ ls -l /tmp/lineitem.json
-rwxrwxrwx 1 hoge hoge 3665538085 Mar  6 14:24 /tmp/lineitem.json

Verification work

JSON to Yosegi conversion

Convert from JSON to Yosegi. The processing time was about 2 minutes and 41 seconds.

$ time HEAP_SIZE=1g ./bin/yosegi.sh create -i /tmp/lineitem.json -o /tmp/lineitem.yosegi.gz -f json

real    2m41.488s
user    1m47.595s
sys     0m44.432s

Conversion from Yosegi to Apache Arrow

Converts the output Yosegi file to Apache Arrow format. The processing time was about 16 seconds.

$ time HEAP_SIZE=1g ./bin/yosegi.sh to_arrow -i /tmp/lineitem.yosegi.gz -o /tmp/lineitem.arrow

real    0m16.134s
user    0m9.927s
sys     0m2.392s

The data size will be approximately 1.5GB. Since there is little Key information, it is significantly less than JSON.

$ ls -l /tmp/lineitem.arrow
-rwxrwxrwx 1 hoge hoge 1566760242 Mar  6 16:24 /tmp/lineitem.arrow

Conversion from Apache Arrow to Yosegi

Converts the output Apache Arrow file to Yosegi format. The processing time was about 1 minute and 9 seconds.

$ time HEAP_SIZE=1g ./bin/yosegi.sh from_arrow -i /tmp/lineitem.arrow -o /tmp/lineitem_from_arrow.yosegi.gz

real    1m9.046s
user    1m4.562s
sys     0m1.558s

Yosegi to JSON conversion

Finally, perform the Yosegi to JSON conversion. The processing time was about 14 minutes and 43 seconds.

$ time HEAP_SIZE=1g ./bin/yosegi.sh cat -i /tmp/lineitem.yosegi.gz -f json -o /tmp/lineitem_from_yosegi.json

real    14m42.576s
user    0m10.492s
sys     8m23.839s

Since the writing destination is a disk, I will set it to / dev / null for comparison.

The JSON conversion took about 46 seconds, and the Apache Arrow conversion took about 10 seconds.

$ time HEAP_SIZE=1g ./bin/yosegi.sh cat -i /tmp/lineitem.yosegi.gz -f json -o "-" > /dev/null

real    0m45.748s
user    0m44.851s
sys     0m0.429s

$ time HEAP_SIZE=1g ./bin/yosegi.sh to_arrow -i /tmp/lineitem.yosegi.gz -o "-" > /dev/null

real    0m10.141s
user    0m9.431s
sys     0m0.347s

Summary of comparison

processing JSON time Apache Arrow time Apache Arrow time/JSON time
Write to Yosegi 161s 69s 0.43
Read from Yosegi 46s 10s 0.22

Apache Arrow is about 2.3 times faster for writing and 4.55 times faster for reading than JSON.

Summary

--In the write process, JSON is added to the process of saving it in memory with the column data structure. Apache Arrow has a column data structure, so it can be converted as it is. Therefore, the process for temporarily saving in the memory is omitted. It is presumed that the reason why the difference is small compared to reading is that the processing time for compression is large.

--In the reading process, JSON is loaded into memory with the column data structure, then read in message units and converted to JSON. Apache Arrow has a column data structure, so it can be loaded as is. Therefore, the process of reading and the process of converting each message are omitted.

From this, it can be said that it is efficient to convert each other via Apache Arrow. Assuming that Apache Arrow will become widespread in the future, it may be more efficient to design input / output based on Apache Arrow. This article has described data formats, but we also want to better understand the exchange of data between languages.

bonus

The Apache Arrow binary can be easily processed with python. Yosegi itself only supports Java, but it can be easily linked by going through Apache Arrow.

import pyarrow as pa

reader = pa.RecordBatchFileReader( pa.OSFile( "/tmp/lineitem.arrow" ) )

rb = reader.get_record_batch(0)
df = rb.to_pandas()
print( df["l_linestatus"].value_counts() )

The execution result of the above program is as follows.

$ time python a.py
F    25129
O    24871
Name: l_linestatus, dtype: int64

real    0m0.327s
user    0m0.269s
sys     0m0.042s

at the end

Yosegi, which is currently developing as OSS, is looking for users and developers! If you have any interest in Yosegi, please feel free to contact us!

Recommended Posts

Compared processing time to convert Apache Arrow and JSON to Yosegi
Convert JSON to TSV and TSV to JSON with Ruby
Convert Java enum enums and JSON to and from Jackson
[Java] Convert JSON to Java and Java to JSON-How to use GSON and Jackson-
[Android] Convert Map to JSON using GSON in Kotlin and Java
Convert ruby object to JSON format
Compared to Sum and Summing Int
How to convert LocalDate and Timestamp
Convert from java UTC time to JST time