In this article, I would like to compare the processing time of converting Apache Arrow and JSON to columnar format Yosegi. think.
Yosegi has CLI, so use this.
First, clone from GitHub. Run the setup command after compiling this jar.
$ git clone https://github.com/yahoojapan/yosegi-tools
$ mvn clean package
$ ./bin/setup.sh
This makes the command available. There is a sample JSON file, you can check the operation here.
$ ./bin/yosegi.sh create -i ./etc/sample_json.txt -o /tmp/a.json -f json
$ ./bin/yosegi.sh cat -i /tmp/a.json -o "-"
{"summary":{"total_price":550,"total_weight":412},"number":5,"price":110,"name":"apple","class":"fruits"}
{"summary":{"total_price":800,"total_weight":600},"number":10,"price":80,"name":"orange","class":"fruits"}
Use the data obtained by converting the lineitem of TPC-H to JSON. The data size is about 3.5GB.
$ ls -l /tmp/lineitem.json
-rwxrwxrwx 1 hoge hoge 3665538085 Mar 6 14:24 /tmp/lineitem.json
Convert from JSON to Yosegi. The processing time was about 2 minutes and 41 seconds.
$ time HEAP_SIZE=1g ./bin/yosegi.sh create -i /tmp/lineitem.json -o /tmp/lineitem.yosegi.gz -f json
real 2m41.488s
user 1m47.595s
sys 0m44.432s
Converts the output Yosegi file to Apache Arrow format. The processing time was about 16 seconds.
$ time HEAP_SIZE=1g ./bin/yosegi.sh to_arrow -i /tmp/lineitem.yosegi.gz -o /tmp/lineitem.arrow
real 0m16.134s
user 0m9.927s
sys 0m2.392s
The data size will be approximately 1.5GB. Since there is little Key information, it is significantly less than JSON.
$ ls -l /tmp/lineitem.arrow
-rwxrwxrwx 1 hoge hoge 1566760242 Mar 6 16:24 /tmp/lineitem.arrow
Converts the output Apache Arrow file to Yosegi format. The processing time was about 1 minute and 9 seconds.
$ time HEAP_SIZE=1g ./bin/yosegi.sh from_arrow -i /tmp/lineitem.arrow -o /tmp/lineitem_from_arrow.yosegi.gz
real 1m9.046s
user 1m4.562s
sys 0m1.558s
Finally, perform the Yosegi to JSON conversion. The processing time was about 14 minutes and 43 seconds.
$ time HEAP_SIZE=1g ./bin/yosegi.sh cat -i /tmp/lineitem.yosegi.gz -f json -o /tmp/lineitem_from_yosegi.json
real 14m42.576s
user 0m10.492s
sys 8m23.839s
Since the writing destination is a disk, I will set it to / dev / null for comparison.
The JSON conversion took about 46 seconds, and the Apache Arrow conversion took about 10 seconds.
$ time HEAP_SIZE=1g ./bin/yosegi.sh cat -i /tmp/lineitem.yosegi.gz -f json -o "-" > /dev/null
real 0m45.748s
user 0m44.851s
sys 0m0.429s
$ time HEAP_SIZE=1g ./bin/yosegi.sh to_arrow -i /tmp/lineitem.yosegi.gz -o "-" > /dev/null
real 0m10.141s
user 0m9.431s
sys 0m0.347s
processing | JSON time | Apache Arrow time | Apache Arrow time/JSON time |
---|---|---|---|
Write to Yosegi | 161s | 69s | 0.43 |
Read from Yosegi | 46s | 10s | 0.22 |
Apache Arrow is about 2.3 times faster for writing and 4.55 times faster for reading than JSON.
--In the write process, JSON is added to the process of saving it in memory with the column data structure. Apache Arrow has a column data structure, so it can be converted as it is. Therefore, the process for temporarily saving in the memory is omitted. It is presumed that the reason why the difference is small compared to reading is that the processing time for compression is large.
--In the reading process, JSON is loaded into memory with the column data structure, then read in message units and converted to JSON. Apache Arrow has a column data structure, so it can be loaded as is. Therefore, the process of reading and the process of converting each message are omitted.
From this, it can be said that it is efficient to convert each other via Apache Arrow. Assuming that Apache Arrow will become widespread in the future, it may be more efficient to design input / output based on Apache Arrow. This article has described data formats, but we also want to better understand the exchange of data between languages.
The Apache Arrow binary can be easily processed with python. Yosegi itself only supports Java, but it can be easily linked by going through Apache Arrow.
import pyarrow as pa
reader = pa.RecordBatchFileReader( pa.OSFile( "/tmp/lineitem.arrow" ) )
rb = reader.get_record_batch(0)
df = rb.to_pandas()
print( df["l_linestatus"].value_counts() )
The execution result of the above program is as follows.
$ time python a.py
F 25129
O 24871
Name: l_linestatus, dtype: int64
real 0m0.327s
user 0m0.269s
sys 0m0.042s
Yosegi, which is currently developing as OSS, is looking for users and developers! If you have any interest in Yosegi, please feel free to contact us!
Recommended Posts