[LINUX] Verify the compression rate and time of PIXZ used in practice


How do you compress your data backups? In this entry, I would like to summarize a brief verification result for the XZ compression format, which is known for its high compression ratio.

The reason for this verification was that I was looking for a way to back up data to S3 on AWS. I am doing double backup with a data server that contains project information related to business. As a disaster countermeasure, I decided to take a backup to a remote server. However, costs are incurred depending on the amount of data and communication volume, so we conducted verification to reduce this cost.

When I researched on the net, I found some articles that verified the compression ratio and the time required for compression, I had the impression that the types of data were biased, and many of them were close to logical verification.

Therefore, I would like to verify how much effect can be obtained with data types that are common in business.

Purpose of verification

Validation data

├ design:    1.8GB
├ logs:      8.8GB
└ wordpress: 50MB

First, I prepared the test data in the directory. The design data includes data such as PDF, AI, PSD, XD, and PNG. In order to make the structure more practical, we dare to include multiple types. It's supposed to be data from designers and directors.

For the log data, we prepared 8.8 GB of log files output daily on the WEB server. This is also a common pattern in server operation.

Finally, it's the data that contains the source code. We have prepared the Wordpress package in the default state.

Method of verification

This time, we will verify with the multi-threaded version of XZ so that it can be closer to practice. XZ has a high compression rate, so the compression time is extremely long. It was impractical to compress hundreds of GB with a single thread, so We will verify with multithreading, but of course it depends on the performance of the machine.

Verification machine

# iMac (Retina 5K, 27-inch, Late 2015)
CPU Core i7-6700K 4 cores 8 threads(4.0〜4.2GHz)
RAM 32GB DDR3 1867MHz
SSD WD Black SN750 (Read/2700MB for both Write/s)

Since the CPU has 4 cores and 8 threads, this verification uses 8 threads. Read / write speed affects I / O, so you have to consider that it is SSD. Also, since the memory is DDR3, it must be taken into consideration that it is inferior to the current DDR4.

Verification environment

Install PIXZ

$ brew install pixz

Verification command

$ tar -C Parent directory path to be compressed-cf -Directory name to be compressed| pixz -9 >Output file path

$ pixz -d -i Output file path| tar zxf -

If you specify an absolute path in the tar command, the compressed file will contain the absolute path, so the "-C" option is used as a countermeasure.

Conduct verification

Design data

report_design.png After all, is it because it contains various data formats? It took about 2 minutes, but it's slow for 1.8GB. The file size is now 34%, so the compression ratio is 66%. Compared to compression, decompression was faster than I expected, and I was surprised.

Log data

report_logs.png It is log data with only text data, but it takes 8 minutes. Comparing the design data, it seems to be proportional, It seems that the log data is a little faster. The compressed data is about 5% in size, and the compression rate is 95%! And decompression is fast for the capacity!

Source code data

report_wordpress.png Finally, it's Wordpress source data. Since the capacity is small, it takes about 18 seconds to complete. The size after compression is about 18%, which is 82% compression ratio. As expected, unlike the log data, I think the cause is that some image data was included.

inspection result

type of data Data capacity Compression time Capacity after compression Compression rate Defrosting time
Design data 1.8GB 2 minutes 19 seconds 624MB 66% 6.2 seconds
Log data 8.8GB 8 minutes 11 seconds 480MB 95% 15.6 seconds
Source code 50MB 18.7 seconds 9.1MB 82% 1.7 seconds

Since this verification is just a "standard", it is calculated in MB units, and the decimal point is omitted for the time. Please note that this is not a strict verification result.


From the above results, I was able to understand that the compression rate and time change depending on the type of data. One question remains. This time, I created an archive file with tar and then compressed it, so Isn't the compression ratio dependent on tar at the time of archiving? That is.

If so, compression in XZ does not change with the type of data, but only with the compression ratio of tar. There is also the possibility that. I think we still need to verify this, If anyone is familiar with it, please let me know.

Load on the machine

I ran the compression in 8 threads and the CPU usage during compression was 600-800%. Since all cores were close to 100% Considering business use, it is essential to limit the number of threads to be allocated.

Also, when using with a VPS server, if you continue to operate for a long time with a high CPU usage rate, you may be subject to CPU usage restrictions.

In EC2, there is a possibility that the CPU credits will be used up early in the T instance, so I think it is better to consider a compression method that does not put a load on the CPU.

Compression rate and time

The design data was 66%, the log data was 95%, and the source code was 82%, which were very satisfactory results in terms of compression ratio. In particular, design data often cannot save capacity even if it is compressed, so it seems that it can be used.

The compression ratio is good, but it takes too long ... It's an impression that it's this time, using 8 threads. It may be a little difficult in an environment where available resources are limited, but it seems that there are various uses for personal use.

The defrosting time is reasonably fast for its capacity, so it seems to be able to handle moderate urgency.

It was a less rigorous verification, but I hope it will be helpful for those who want to know a guideline.

Recommended Posts

Verify the compression rate and time of PIXZ used in practice
Explanation and implementation of the XMPP protocol used in Slack, HipChat, and IRC
Predict the amount of electricity used in 2 days and publish it in CSV
Read the output of subprocess.Popen in real time
Fix the argument of the function used in map
Used from the introduction of Node.js in WSL environment
I investigated the calculation time of "X in list" (linear search / binary search) and "X in set"
Check the processing time and the number of calls for each process in python (cProfile)
The story of creating a "spirit and time chat room" exclusively for engineers in the company
Implement the mathematical model "SIR model" of infectious diseases in OpenModelica (reflecting mortality rate and reinfection rate)
Applied practice of try/except and dictionary editing and retrieval in Python
I tried to illustrate the time and time in C language
[Python] Display the elapsed time in hours, minutes, and seconds (00:00:00)
Get the current date and time in Python, considering the time difference
[Tips] Problems and solutions in the development of python + kivy
The story of returning to the front line for the first time in 5 years and refactoring Python Django
Determine the date and time format in Python and convert to Unixtime
I compared the calculation time of the moving average written in Python
Scraping the schedule of Hinatazaka46 and reflecting it in Google Calendar
Probability of getting the highest and lowest turnip prices in Atsumori
Notify the contents of the task before and after executing the task in Fabric
A function that measures the processing time of a method in python
Browse .loc and .iloc at the same time in pandas DataFrame
Get the title and delivery date of Yahoo! News in Python
The story of Python and the story of NaN
Process the result of% time,% timeit
The story of the "hole" in the file
The meaning of ".object" in Django
Graph of the history of the number of layers of deep learning and the change in accuracy
Comparing the basic grammar of Python and Go in an easy-to-understand manner
How to get the date and time difference in seconds with python
Get and convert the current time in the system local timezone with python
Open an Excel file in Python and color the map of Japan
[Introduction to Python] Thorough explanation of the character string type used in Python!
Get a datetime instance at any time of the day in Python
About the main tasks of image processing (computer vision) and the architecture used