[LINUX] I heard rumors that malloc is slow and should be stored in memory, so I compared it.

** Introduction **

I heard from someone that "malloc is slow, so it's faster to store memory and reuse it." If you really want to do it, you have to lock it, isn't it? While thinking, it seems better to make a rapper if it is true. This kind of thing can't be helped even if you continue the theory on the desk, so let's actually measure it! ** **

This time, ** we will perform repeated calloc / free tests on the following 4 types and measure **. I chose calloc because I personally use it more often. I regret that malloc has more pure accuracy.

  1. Ordinary calloc / free
  2. Your own rapper
  3. An updated version that eliminates the above bottleneck by borrowing the wisdom of our predecessors via google teacher
  4. Use tc_malloc included in google's gperftools (this method do not forcibly overwrite normal malloc)

The library simply determines the amount and size of memory to be stored. So, as a prediction before measurement, if it is within the memory usage and size, 2 and 3 are fast, and outside that range, 1 is a good match, 4 is not. What will happen now

** Wrapper specifications **

Have the init function called and create a memory list there. The list has a used / unused flag, passing unused memory during calloc and moving that data to the tail of the list. That way, the first head is always unused and the tail is in use memory data. 初期構想.png

On the contrary, when free, the memory data in use is searched from tail and it is searched whether it matches the data to be free. Matches are not freed, they are changed to unused and moved to the beginning. 初期構想_free.png

Since the positions of head and tail are remembered, I'm worried about searching when free, but other than that, I can pass the memory immediately. Isn't it a good feeling! ??

** Updated specifications that eliminate bottlenecks **

2018/08/05 Added Library. This is mempool.

The neck of the above wrapper is still the behavior when free. When almost all memory data is in use, you have to search almost all the data. I was wondering if I had to review the search list structure, but I found a good move. This site, my hand is introduced as a bad answer. Lol If you set the condition of the last correct answer, "** The size of the memory area is the same and it is a power of 2 bytes. **", the position can be grasped immediately by bit operation. It ’s wonderful!

Based on the above, we reviewed the rapper that became the bottleneck.

First of all, at the beginning. The list and the memory entity will be lined up. In addition, size specification is limited to 2 power bytes. Then, the memory entities are lined up in multiples of 2 ^ x like this, so if you shift the bits x times when free, you can find the position. hhs_init.png

Now, what happens when you are free? First, you can tell whether the memory entity to be free is within management from the address range of the memory entity created first. So you don't have to do a full search every time like the first wrapper. Also, if applicable, bit shift to specify the position ⇒ Take out the corresponding list data entity and move it to the head. The search process is no longer needed. Isn't this faster?

hhs_free.png

Also, for the method of acquiring the bit shift x times, I referred to Code for finding the bit position here. Since it is used only during init, it is not involved in this measurement, but it is also amazing that you do not need time to work hard on the bit shift to the end.

I thought that this area was my limit, so I will measure it.

** Measurement **

The measurement tool and source code are located below. https://github.com/developer-kikikaikai/speedtest

After building, you can measure by executing run.rb in sample / mallocspeed.

** Measurement conditions 1. **

Ask the creation library to store 1024 1024 bytes of memory. Under these conditions, the following measurements are performed 50 times, 100 times, 1024 times, and 10000 times 10 times, and the total time is output.

  1. 1024byte malloc / free
  2. 1024byte * 2 and 1024byte malloc / free

Time measurement is done in the speedtest library. (Since the timed log is dumped in memory and displayed at the end, I think that the overhead due to measurement is not that much.

** Measurement environment **

Linux Ubuntu 18.04 Memory 8GByte 4 CPUs (on VM)

** How to read the results **

Description meaning
Action column What measurement is represented
Total time column total time.-0 for 3.001(1/10^3)means
XXX calloc(calloc/free under initial size) 1024byte malloc/free measurement
XXX calloc(calloc/free initial size+over size) 1024byte*2 and 1024byte malloc/free measurement
XXX=normal Ordinary calloc/free(Described as normal below)
XXX=wrap First wrapper
XXX=hhs_calloc Bottleneck elimination version(Below hhs_Described as calloc)
XXX=tc_calloc tc_calloc/tc_Use free(Below tc_Described as calloc)

Measurement result

** 50 times x 10 measurements **

Action Total time
normal calloc(calloc/free under initial size) 0.15806e-3
wrap calloc code(calloc/free under initial size) 0.75032e-4
hhs_calloc(calloc/free under initial size) 0.135938e-3
tc_calloc(calloc/free under initial size) 0.186022e-3
normal calloc(calloc/free initial size+over size) 0.711725e-3
wrap calloc(calloc/free initial size+over size) 0.487427e-3
hhs_calloc(calloc/free initial size+over size) 0.443236e-3
tc_calloc(calloc/free initial size+over size) 0.702283e-3
  1. 1024byte malloc / free: First wrapper >> hhs_calloc> Normal> tc_calloc
  2. 1024byte * 2 and 1024byte malloc / free: hhs_calloc c> First wrapper >> tc_calloc> Normal

The two wrappers are fast while the number is small. The first rapper is particularly strong if it is only within range. Also, if the size is oversized, hhs_calloc is already good. Isn't both good?

(Titles 1 and 2 are omitted below)

** 100 times x 10 **

Action Total time
normal calloc(calloc/free under initial size) 0.283713e-3
wrap calloc code(calloc/free under initial size) 0.153599e-3
hhs_calloc(calloc/free under initial size) 0.272328e-3
tc_calloc(calloc/free under initial size) 0.293742e-3
normal calloc(calloc/free initial size+over size) 0.1394836e-2
wrap calloc(calloc/free initial size+over size) 0.1395083e-2
hhs_calloc(calloc/free initial size+over size) 0.1244906e-2
tc_calloc(calloc/free initial size+over size) 0.1337618e-2
  1. First wrapper> hhs_calloc> Normal> tc_calloc
  2. hhs_callo c> tc_calloc> first wrapper> normal

It's still about 1/10 of the maximum number of registrations, but the speed of the first wrapper has already dropped sharply around here. This area is a level where the ranking changes each time it is executed.

** 500 times x 10 **

Action Total time
normal calloc(calloc/free under initial size) 0.1453858e-2
wrap calloc code(calloc/free under initial size) 0.301273e-2
hhs_calloc(calloc/free under initial size) 0.1291004e-2
tc_calloc(calloc/free under initial size) 0.13424e-2
normal calloc(calloc/free initial size+over size) 0.7863012e-2
wrap calloc(calloc/free initial size+over size) 0.44953204e-1
hhs_calloc(calloc/free initial size+over size) 0.6339118e-2
tc_calloc(calloc/free initial size+over size) 0.6493924e-2
  1. hhs_calloc> = tc_calloc> Normal >>> First wrapper
  2. hhs_calloc> = tc_calloc> Normal >> First wrapper

With about half the maximum number of registrations, the first rapper was by far the slowest. Dropping out faster than I expected (-_-;) Also, the power of tc_calloc stands out from this area.

** 1024 times x 10 **

Action Total time
normal calloc(calloc/free under initial size) 0.2824874e-2
wrap calloc code(calloc/free under initial size) 0.20755398e-1
hhs_calloc(calloc/free under initial size) 0.2560653e-2
tc_calloc(calloc/free under initial size) 0.2751811e-2
normal calloc(calloc/free initial size+over size) 0.16774762e-1
wrap calloc(calloc/free initial size+over size) 0.54582233e-1
hhs_calloc(calloc/free initial size+over size) 0.13983322e-1
tc_calloc(calloc/free initial size+over size) 0.14046191e-1
  1. hhs_calloc> tc_calloc> Normal >>> First wrapper
  2. hhs_calloc> = tc_calloc> Normal >> First wrapper

Just reached the maximum number of registrations. Hfs_calloc is doing its best even in the case where unexpectedly out of range is mixed.

** 10000 times x 10 **

Action Total time
normal calloc(calloc/free under initial size) 0.30325826e-1
wrap calloc code(calloc/free under initial size) 0.5094687e-1
hhs_calloc(calloc/free under initial size) 0.30124209e-1
tc_calloc(calloc/free under initial size) 0.26917778e-1
normal calloc(calloc/free initial size+over size) 0.176232715e0
wrap calloc(calloc/free initial size+over size) 0.221244686e0
hhs_calloc(calloc/free initial size+over size) 0.176962985e0
tc_calloc(calloc/free initial size+over size) 0.146961605e0
  1. tc_calloc> hhs_calloc> = normal >>> first wrapper
  2. tc_calloc> Normal> hhs_calloc >> First wrapper

Finally tc_calloc is at the top. It seems that the more the number, the more powerful it is. At this point, hhs_calloc is no different from normal calloc, so I'm just trying my best to escape with an advantage of up to 1024 times.

** Result # 2: Result of malloc overwrite by tc_malloc **

This result was interesting, so I will post it. It's already a measurement experiment of tc_malloc. Since each calloc / free is overwritten by tc_malloc, each result is longer than the first.

** 50 times x 10 measurements **

Action Total time
normal calloc(calloc/free under initial size) 0.14662e-3
wrap calloc code(calloc/free under initial size) 0.7501e-4
hhs_calloc(calloc/free under initial size) 0.14337e-3
tc_calloc(calloc/free under initial size) 0.22985e-4
normal calloc(calloc/free initial size+over size) 0.803501e-3
wrap calloc(calloc/free initial size+over size) 0.28677e-3
hhs_calloc(calloc/free initial size+over size) 0.271435e-3
tc_calloc(calloc/free initial size+over size) 0.117837e-3
  1. tc_calloc> First wrapper >> hhs_calloc> Normal
  2. tc_calloc> hhs_calloc> first wrapper> normal

Here, tc_calloc is already by far the best. I thought it was about the same as calloc / free, which was completely wrapped, but I was surprised.

** 100 times x 10 **

Action Total time
normal calloc(calloc/free under initial size) 0.27454e-3
wrap calloc code(calloc/free under initial size) 0.147085e-3
hhs_calloc(calloc/free under initial size) 0.270945e-3
tc_calloc(calloc/free under initial size) 0.43941e-4
normal calloc(calloc/free initial size+over size) 0.1559897e-2
wrap calloc(calloc/free initial size+over size) 0.759443e-3
hhs_calloc(calloc/free initial size+over size) 0.489991e-3
tc_calloc(calloc/free initial size+over size) 0.255653e-3
  1. tc_calloc> first wrapper> hhs_calloc> normal
  2. tc_calloc> hhs_calloc> first wrapper> normal

The speed of tc_calloc is unabated. Other than that, it doesn't change. Out-of-range speeds are affected by calloc / free overrides, and wrappers are faster.

** 500 times x 10 **

Action Total time
normal calloc(calloc/free under initial size) 0.1373838e-2
wrap calloc code(calloc/free under initial size) 0.1687707e-2
hhs_calloc(calloc/free under initial size) 0.1296464e-2
tc_calloc(calloc/free under initial size) 0.262718e-3
normal calloc(calloc/free initial size+over size) 0.7076653e-2
wrap calloc(calloc/free initial size+over size) 0.13166146e-1
hhs_calloc(calloc/free initial size+over size) 0.2556278e-2
tc_calloc(calloc/free initial size+over size) 0.1392622e-2
  1. tc_calloc> hhs_calloc> Normal> First wrapper
  2. tc_calloc> hhs_calloc> Normal> First wrapper

Not to mention tc_calloc. Considering the first result, hhs_calloc is doing its best.

** 1024 times x 10 **

Action Total time
normal calloc(calloc/free under initial size) 0.2763511e-2
wrap calloc code(calloc/free under initial size) 0.6518494e-2
hhs_calloc(calloc/free under initial size) 0.2707665e-2
tc_calloc(calloc/free under initial size) 0.561575e-3
normal calloc(calloc/free initial size+over size) 0.14081652e-1
wrap calloc(calloc/free initial size+over size) 0.16251563e-1
hhs_calloc(calloc/free initial size+over size) 0.4127922e-2
tc_calloc(calloc/free initial size+over size) 0.3438721e-2
  1. tc_calloc> hhs_calloc> Normal> First wrapper
  2. tc_calloc> hhs_calloc> Normal> First wrapper

It is said that the difference will widen within the range that rappers should be good at. I think tc_calloc is also using memory well.

** 10000 times x 10 **

Action Total time
normal calloc(calloc/free under initial size) 0.27661637e-1
wrap calloc code(calloc/free under initial size) 0.18210164e-1
hhs_calloc(calloc/free under initial size) 0.12308628e-1
tc_calloc(calloc/free under initial size) 0.9218671e-2
normal calloc(calloc/free initial size+over size) 0.148160442e0
wrap calloc(calloc/free initial size+over size) 0.84871154e-1
hhs_calloc(calloc/free initial size+over size) 0.68569389e-1
tc_calloc(calloc/free initial size+over size) 0.65872532e-1
  1. tc_calloc> hhs_calloc> first wrapper> normal
  2. tc_calloc> hhs_calloc> first wrapper> normal

There seems to be a library that goes beyond the field I prepared and speeds up. The uncle who created it is sad.

** Measurement 2 **

The above was a simple comparison, so this time we will build a scenario based on the use case of the application we are thinking about.

Measurement condition.

The target application is "multithreading of ligttpd". Therefore, we will measure with this assumption.

--Measurements are for requests from all connected clients (target 100,000) --Roughly estimate the memory required for 1 request / response (about 4096 bytes x 256 (1 MByte)) --Roughly estimate the number of connections that lighttpd can handle (always hold) at one time (10)

So

element conditions
Memory pool 10 clients ⇒ 4096 bytes x 256 x 10 pools
test case 10 clients out of all clients are connected. After allocating that much memory, the release is repeated. * Since the timing of returning the client response is different, change the release order as appropriate.
Assumed all connected clients 2000 (10,000 seems to take time so stop)

Under these conditions, each measurement is performed 5 times and the total time is output.

** Measurement result 2. **

Call 5 times, input is 512000

Action Total time
normal calloc(calloc/free under initial size) 0.2694234765e1
wrap calloc code(calloc/free under initial size) 0.27786253728e2
hhs_calloc(calloc/free under initial size) 0.673090478e0
tc_calloc(calloc/free under initial size) 0.617789121e0

hhs_calloc> = tc_calloc> Normal> First wrapper

The first two are levels where the order changes depending on the execution timing

** Result # 2: Result of malloc overwrite by tc_malloc **

Call 5 times, input is 512000

Action Total time
normal calloc(calloc/free under initial size) 0.593853921e0
wrap calloc code(calloc/free under initial size) 0.4174470792e1
hhs_calloc(calloc/free under initial size) 0.6199698e0
tc_calloc(calloc/free under initial size) 0.592267608e0

tc_calloc> = normal> = hhs_calloc> first wrapper

The previous three were levels where the order was changed each time it was executed. However, in this result, the calloc replaced by tc_malloc has affected hhs_calloc. Compare again with only tc_calloc measurement of tc_malloc.

Action Total time
normal calloc(calloc/free under initial size) 0.2694234765e1
wrap calloc code(calloc/free under initial size) 0.27786253728e2
hhs_calloc(calloc/free under initial size) 0.673090478e0
tc_calloc(calloc/free under initial size) 0.592267608e0

hhs_calloc> tc_calloc> Normal> First wrapper

Looking at it this way, hhs_calloc may be interesting if tuned well.

2018/06/10 I was curious about the comments, so I also tried no lock on hhs_calloc. Compare side by side with tc_calloc results.

Action Total time
hhs_calloc(calloc/free under initial size) 0.589852316e0
tc_calloc(calloc/free under initial size) 0.592267608e0

Oh, the record changes every time you measure, but if there is no lock, hhs_calloc may win. It's interesting if you use the threads properly!

** Conclusion **

――It is true that ** malloc / free will be faster if you decide how to use it and wrap it well according to the situation . However, if you do not seriously consider the speed, it will slow down. - tc_malloc is fast if you use it for speed **. When using it, it is faster to replace it with ** tc_ instead of wearing it sideways **.

By the way, I haven't seen the memory usage. If the memory usage of tc_malloc is huge, you may want to make your own wrapper.

reference

Reference for eliminating bottlenecks How to realize a large number of high-speed and memory-efficient sets using a large memory area of alignment

Bit shift reference ■ [C #] "Amazing" code to find the rightmost standing bit position

Measurement: For studying ruby processing and floating point at the time of result aggregation About decimals, Float and BigDecimal by Ruby ... (for beginners)

Recommended Posts

I heard rumors that malloc is slow and should be stored in memory, so I compared it.
memory is not released just by plt.close (). It should be plt.clf () → plt.close ().
I hear that Matlab and Python loops are slow, but is that true?
I thought it would be slow to use a for statement in NumPy, but that wasn't the case.
Note that I understand the least squares algorithm. And I wrote it in Python.