Note on how to merge multiple Parquet files into one Parquet file
You can easily merge using parquet-tools. However, distributed as a jar does not seem to include hadoop-client, so it seems to be local. Could not be executed.
# java -jar parquet-tools-1.9.0.jar cat test.parquet
org/apache/hadoop/fs/Path
Therefore, you need to build it yourself. I built it referring to the following site. https://www.lancork.net/2016/10/inspect-parquet-files-using-parquet-tools/
Use the one set up on the CentOS7.5 server (using GUI)
** 1. Install packages that you may need **
# yum install gcc gcc-c++ java-1.8.0-openjdk-devel boost-devel openssl-devel
** 2. Install maven **
# wget http://ftp.yz.yamagata-u.ac.jp/pub/network/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
# tar zxvf apache-maven-3.3.9-bin.tar.gz
** 3. Install Thrift **
# wget -nv http://archive.apache.org/dist/thrift/0.12.0/thrift-0.12.0.tar.gz
# tar zxvf thrift-0.12.0.tar.gz
# cd thrift-0.12.0/
# ./configure -disable-gen-erl -disable-gen-hs -without-ruby -without-haskell -without-erlang -without-php -without-nodejs
# make install
** 4. Download the complete source code ** https://github.com/apache/parquet-mr Download the complete source code from the link above.
# unzip parquet-mr-master.zip
# cd parquet-mr-master/
** 5. Build ** Execute the mvn command under [current directory] / parquet-mr-master / I didn't notice ↑ and it was running under [current directory] / parquet-mr-master / parquet-tools, so I was addicted to it.
mvn clean package -Plocal
Under [current directory] / parquet-mr-master / parquet-tools / target A file called parquet-tools-1.12.0-SNAPSHOT.jar (as of May 8, 2020) is generated. Use this to merge.
Any Parquet file to merge was fine, so I used the file linked below that came out by google. http://anson.ucdavis.edu/~clarkf/
Check if you can see the contents
# java -jar ./parquet-tools-1.12.0-SNAPSHOT.jar cat test.parquet
timeperiod = 01/01/2016 00:00:05
flow1 = 0
occupancy1 = 0.0
speed1 = 0.0
flow2 = 0
occupancy2 = 0.0
speed2 = 0.0
flow3 = 0
occupancy3 = 0.0
speed3 = 0.0
timeperiod = 01/01/2016 00:00:35
flow1 = 0
occupancy1 = 0.0
speed1 = 0.0
flow2 = 0
occupancy2 = 0.0
speed2 = 0.0
flow3 = 0
occupancy3 = 0.0
speed3 = 0.0
···(abridgement)···
Merge
# java -jar ./parquet-tools-1.12.0-SNAPSHOT.jar merge ./data/*.parquet ./merge.parquet
Warning: file data/part-r-00000-ddaee723-f3f6-4f25-a34b-3312172aa6d7.snappy.parquet is too small, length: 16979
Warning: file data/part-r-00001-ddaee723-f3f6-4f25-a34b-3312172aa6d7.snappy.parquet is too small, length: 18350
···(abridgement)···
Warning: you merged too small files. Although the size of the merged file is bigger, it STILL contains small row groups, thus you don't have the advantage of big row groups, which usually leads to bad query performance!
I get a warning that the performance will drop if I merge, but I was able to merge.
that's all.
Recommended Posts