Python's pandas
and DataFrame.to_parquet
are so good that it's a trend that "Python is for handling parquet files".
https://pandas.pydata.org/pandas-docs/version/0.22.0/generated/pandas.DataFrame.to_parquet.html#pandas.DataFrame.to_parquet
I found it easy to make it in Ruby, so I'll share it.
You can use the official apache gem. (Note that ≠ red-arrow) https://github.com/apache/arrow/tree/master/ruby/red-parquet
gem installation
$ gem install red-parquet
Create test file (csv)
$ echo colA,colB > test.csv
$ echo 1,2 >> test.csv
Conversion process on ruby (csv-> parquet)
$ irb
irb(main):001:0> require "parquet"
=> true
irb(main):002:0> table = Arrow::Table.load("./test.csv")
=> #<Arrow::Table:0x7fbb0d3e6708 ptr=0x7fbb0e0a4010>
colA colB
0 1 2
irb(main):003:0> table.save("./test.parquet")
=> true
Raise test.parquet to S3 and check with S3 Select
did it! !! (He also does type inference ...!)
If you read this area, it seems that you can operate files even with Ruby unexpectedly. https://www.slideshare.net/kou/datasciencerb
Recommended Posts