Introducing principles and samples that are versatile in performance tuning on the program side and are likely to be cost-effective in any language or field.

Conclusion

Keep the following three principles in mind when tuning and developing your program.

-** Principle 1. Reduce the amount of calculation ** --Improve multiple loops --Improve the algorithm -** Principle 2. Batch high-cost processing ** --Collect SQL, REST API, external commands --Collect system calls (ex. Read, write) -** Principle 3. Reuse high-cost items ** --Reuse TCP, HTTP, DB, etc. connections --Reuse the inquiry result from the external server --Reuse reserved memory, objects, threads, etc.

Tuning / optimization of system architecture, hardware configuration, etc. is out of the scope of this article. (Principles 2 and 3 can be applied, but we will not go that far)

background

When I am involved in various system development, I feel that there are surprisingly few people who are conscious of performance in design and development. Considering the purpose and performance requirements of the development target, even if it is a part that clearly seems to be a big problem in the performance verification etc. later, the development will proceed without incident.

Of course, the saying "halfway optimization is bad and should be tuned later when needed" is correct. But isn't this maxim implicitly expected to ** make hypotheses and predictions about some performance bottlenecks at the time of designing and coding? That's why I think that the phrase "let's make it easier to tune later by making it a method or module where tuning is likely to be necessary" is often added.

Without understanding this fact, I proceeded with work such as coding and unit testing without making hypotheses and predictions about performance bottlenecks, and when system tests were carried out, fatal performance bottlenecks occurred and I went back and forth. It's not uncommon to see them go away. ** It's okay to say "performance tuning should be done when needed later", but if you don't have the necessary guidelines (hypotheses and predictions), knowledge, skills, etc., it's just a theoretical theory. .. It's no different than simply procrastinating problem response. ** **

NOTE: If you want to develop something like Intel DPDK (Development Kit for processing network packets at super high speed), you naturally need to mobilize all the knowledge and skills for more advanced optimization (ex. Parallelization) , CPU etc. Affinity, cache hits, cache lines, CPU architecture, compiler options, hardware offload, etc.). But unless it's such a special area, there are other optimizations that should be a top priority.

Features of performance tuning to be introduced

In this article, we will introduce some of the principles of performance tuning that have the following characteristics, as well as samples of typical techniques.

--Can be achieved by modifying the program --Independent of a specific programming language or library

There are many more specialized and advanced optimizations that specialize in programming languages, architectures, hardware, etc., but the more general principles should be followed first.

NOTE: Occasionally I'm very familiar with compiler optimization options, parameter tuning of the Linux kernel, etc., but don't know / apply any of the basic performance tunings covered in this article. Since such people try to solve the problem by performance tuning in their own specialty, they tend to perform extremely ineffective performance tuning depending on the performance bottleneck. For engineers who are not very familiar with the surroundings, there is a bonus that spreads the misunderstanding that "it can not be helped because it is useless to do this much".

Performance tuning principles

Here are three principles that I consider most important. The importance will vary depending on the situation, but basically it is recommended to apply the principles in order from the top.

-** Principle 1. Reduce the amount of calculation ** -** Principle 2. Batch high-cost processing ** -** Principle 3. Reuse high-cost items **

NOTE: I would like to add other things, but if there are too many, it will be difficult to practice, so I have limited to three.

Principle 1. Reduce the amount of calculation

One of the most important things in performance tuning is to reduce the amount of calculation in program processing. If you do not know the amount of calculation, O notation, etc., please refer to the following articles.

[For beginners] How to calculate the amount of calculation of the program Overview how to obtain computational complexity orders! ~ Where does the log come from ~

It's a common sense level for people majoring in information systems and computer science, but unfortunately there are few people who can practice it even if they know it. Whether you are writing a program or writing SQL for a DB, make a habit of always keeping in mind the amount of calculation.

NOTE: By the way, in the case of DB SQL, it is confusing because the optimizer may change the internal execution plan depending on the number of records in the DB table at that time, statistical information, etc. For this reason, SQL that initially operated with O (N) computational complexity may become O (N ^ 2) due to insufficient accuracy of DB statistics, for example.

As a rough standard, ** If the amount of calculation is O (N) or more, be wary that it may be an obstacle to performance, and if it is O (N ^ 2) or more, it will be an absolute obstacle to performance. It is recommended to check with the awareness of **.

--Loop processing --Recursive processing --Data structures and operations (additions, updates, deletions, references, etc.) --Other various algorithms --SQL (ex. Subquery, Join / Scan, etc. execution plan)

Example of judgment criteria

--Computational complexity is O (N ^ 2) or more, number of processing elements N = thousands to tens of thousands or more --Computational complexity is O (N) or more, processing element N = tens to hundreds or more, processing of each element is equivalent to high cost processing --See Principle 2 for high cost processing

Supplement: Computational complexity coefficient

For example, if the number of calculations is calculated by the formula 3 * n, the amount of calculation is O (N). You can ignore the coefficient of 3.

However, when considering the performance of the program, it is important to be aware of this coefficient. For example, if the processing time is 10 seconds with the number of calculations of 10 * n, the processing time can be improved to 1 second if the coefficient part can be set to 1. Of course, it depends on the requirements of the development target, but you can imagine that there are enough cases where improvement of this coefficient part is extremely important.

Principles 2 and 3 described below can also be used to reduce this coefficient portion.

Principle 2. Batch high-cost processing

Considering the processing performed by the program and the required performance (ex. Processing time), there may be high-cost processing that is a major obstacle to satisfying the required performance. It is highly likely that such ** high-cost processing will be executed many times, so consider executing (batch) all at once **.

Whether or not it is considered a high cost process is on a case-by-case basis. However, the following processes are likely to be "high-cost processes that hinder the performance required," so we recommend that you consider them as candidates for high-cost processes.

--External command execution --High process creation cost --Request to server (REST API, SQL, RPC, etc ...) --High cost of communication between servers --Processing time per request is long or may be long --System call --High cost of I / O processing and switching between user space and kernel space --The impact is smaller than the two high-cost treatments mentioned above. --Unknown library ――Cost may remain high until you understand what you are doing

As mentioned earlier, even if it is likely to be an obstacle, it may not be a problem in some situations. For example, consider three patterns when "SQL INSERT (it takes 1ms each time)", which is the control term of a certain process, is executed.

--Perform SQL INSERT at most 10 times in the process that needs to be done within 3 seconds --Since 1ms * 10 = 10ms, it is extremely unlikely that it will become a performance bottleneck. ** No problem ** --SQL INSERT is performed 10,000 times in the process that needs to be performed within 3 seconds --Since 1ms * 10000 = 10s, it becomes a fatal performance bottleneck. ** Problem ** --SQL INSERT is performed 10,000 times in the process that needs to be performed within 60 seconds --Since 1ms * 10000 = 10s, the possibility of becoming a performance bottleneck is extremely low. ** No problem **

As you can see from the above three patterns, whether or not it becomes an inhibitor depends on the performance requirements and the number of processing elements. For this reason, it is important here to ** make a rough estimate of what may be an inhibitor (candidate for high cost processing) and roughly determine if there is a real problem **. is. If you can not conclude that there is no problem, that part will be hit as a confirmation point.

Once you have identified a high-cost process, you need to consider how to batch that part. In general, high-cost processing such as those mentioned above has some form of collective execution, and when executed collectively, the processing time can often be significantly reduced. If you don't have such a means, you may need to have the callee implement a way to do it all together.

NOTE: Scripting languages such as Python and Ruby are many to dozens of times slower in step-by-step processing than compiled languages such as C / C ++ / Java (JavaScript is slightly slower due to the remarkable evolution of the V8 engine). Things are different). Therefore, depending on the processing content and performance requirements, the step-by-step processing itself may become apparent as a high-cost processing. In such a case, use "standard functions and libraries implemented for high-speed processing in C language, etc." as much as possible to process them all together. For example, in Python, the generation and processing of huge arrays is left to libraries such as Numpy as much as possible.

Example of judgment criteria

—— Executing dozens to hundreds of external commands --More caution is required if the processing time of external commands is long. --Sending hundreds to thousands of HTTP requests, SQL, etc. --More caution is required when the destination is on the Internet, etc. --Thousands to tens of thousands of system calls are being executed ――Be careful when processing thousands to tens of thousands of requests per second or processing a large amount of data. --More attention is required when considering the processing time in units of μs and ns.

Principle 3. Reuse high-cost items

For example, when processing a program, I think that it is often implemented by utilizing threads and various connections (ex. HTTP, SQL). If you use them frequently, creating a new one each time and discarding it can be a surprisingly high cost. For such ** high cost and frequently useds`, if it can be reused, consider a mechanism for reusing it **

Of course, high cost varies greatly depending on the purpose of the program and performance requirements. This is the same as principle 2. For example, if you send an HTTP request only once every 10 minutes, the merit of reusing an HTTP connection is small, but if you send it tens to hundreds of times per second, the merit of reusing an HTTP connection is great. is. Depending on the usage and performance requirements of the program, it may be a big cost to allocate memory with malloc and release it with free every time you use the memory. In such a case, re-allocate the memory area itself. The merit of introducing a mechanism to use is also great.

The typical reuse targets are shown below.

--Connections such as HTTP and DB --Reuse by utilizing the connection pool --For example, JDBC (Java DB API) utilizes Tomcat JDBC Pool. --Thread --Reuse by utilizing the thread pool --For example, in Java, use ExecutorService --Server inquiry result --Cache HTTP response --Cache DNS response for a period of time --If the library etc. has a cache mechanism, use this

Example of judgment criteria

--Dozens to hundreds of threads are created and destroyed per second --A large amount of HTTP, SQL, etc. are issued without being aware of the connection pool and session. --There is a possibility that generation and destruction are repeated depending on the language and library you are using.

Flow of applying the principle

The principle of optimization applies when you do the following:

--When implementing --When tuning against an existing implementation

Check the following when coding. If there are applicable principles, it can be a place that needs to be optimized later.

――Whether or not each principle is met ――The following measures will be taken for the hit points. --TODO comment, log embedding --Run the program and get the performance log --Check the processing time based on the performance log, and judge whether performance tuning is actually necessary from the check result.

By the way, the finer the range to hit, the more time and effort it takes, so it's okay to put together the range of "this and this is suspicious". One way is to subdivide the range and identify the problem area when the range actually becomes a factor that hinders performance.

Let's take the following measures for the relevant parts. Even if you optimize each relevant part without thinking about it, the effect may be weak and it may only be a partial optimization, so we will only prepare for later optimization.

--Embed performance-related logs so that processing time can be measured --Clarify that it is an important optimization candidate in the TODO comment

Tuning sample

Here are some samples to help you achieve each principle. The source code is basically written on the assumption of Python 3.6.

NOTE: The environment (ex. OS, Host / VM / WSL) is different each time when comparing measurement results, so please use only reference information.

Principle 1. Reduce the amount of calculation

Utilize HashMap / Set etc. for multiple loops

Multiple loops are often used in processes such as searching one collection for another. However, since the amount of calculation in multiple loops increases to O (N ^ 2) and O (N ^ 3), it becomes a performance bottleneck at once when the number of processing elements increases.

In such a case, you can reduce the amount of calculation from O (N ^ 2) to O (N), for example, by creating a HashMap or Set from one of the collections and using this for searching. You may think that the cost of creating a new HashMap / Set for the number of elements in the collection is wasteful, but if you look at the number of processing elements in the table below, you can see that the cost is extremely small.

Element count	O(N)×2	O(N^2)
10	20	100
100	200	10,000
10,000	20,000	100,000,000

First, an example of double loop processing is shown. (Processing such as measurement of execution time is omitted)

`bad_performance_no1.py`


all_files = list(range(1, 100000))
target_files = list(range(50000, 60000))
matched_files = []

#Computational complexity is O(N^2) -> BAD
for f in all_files:
    if f in target_files:
        matched_files.append(f)

print(len(matched_files)

What you should pay attention to here is the part of ʻif f in target_files:in the for statement. Since this part is not a loop, it seems that the process is completed once, but to check "whether the target_files of the list contains element s", on average N / 2 until it finds a matching element Element check processing will be performed. The same applies to the collection operation methodscontains ()` etc. in Java. For this reason, ** even if the program grammar does not have multiple loops, the actual processing may be equivalent to multiple loops **.

The following is an example of creating a Set from a List to improve the double loop. (Processing such as measurement of execution time is omitted)

`good_performance_no1.py`


NUM = 1000000
nums = list(range(1, NUM))
# TUNING:Create Set from List with set function
# NOTE: //The reason for dividing by is to hold an integer
expected_nums = set(range(1, NUM//2))
matched_nums = []

#Computational complexity is O(N)Improved to-> GOOD
start = time.time()
for f in nums:
    if f in expected_nums:
        matched_nums.append(f)

print(len(matched_nums)

The comparison result of the loop processing part is shown below.

Computational complexity	Execution time
O(N^2)	8,228 ms
O(N)	4 ms

When the number of processing elements N = 100,000, the speed was increased by about 2000 times by improving the amount of calculation to O (N). When the number of records to be processed is tens of thousands or more like this, even such a simple loop processing has a great improvement effect.

Principle 2. Batch high-cost processing

Summarize mass execution of external commands

When you execute an external command in your program, the more you execute it, the worse the performance will be. This is because executing an external command is a costly process that involves process creation.

Here, we will take as an example a program that aims to "display the total byte size of all files under / etc". You can output the byte size of the specified file by using the wc command with the --bytes option.

NOTE: By the way, most general-purpose programming languages, including Python, provide standard functions for getting the number of bytes in a file. So you don't need to use the wc command. Therefore, in this example, read on the assumption that "the wc command, which is an external command, is absolutely necessary".

`bad_performance_no_batch.py`


import pathlib
from subprocess import check_output

cwp = pathlib.Path("/etc")
files = [f for f in cwp.glob("**/*") if f.is_file()]
    
total_byte = 0
for f in files:
    #Execute wc command for each file->Extremely inefficient
    result = check_output(["wc", "--byte", str(f)])
    size, file_path = result.split()
    total_byte += int(size)
   
print("file num:", len(files))
print("total byte:", total_byte)

The results are shown at the end of this section, but since the wc command is executed for each file, it will take several seconds to several tens of seconds or more if the number of target files is several hundred to several thousand or more. Therefore, it is necessary to devise to reduce the number of executions of the wc command as much as possible. Fortunately, the wc command allows you to specify multiple files with a single command execution. Therefore, the number of wc command executions can be batched from the number of files to one.

An example of batch processing is shown below. (Processing such as measurement of execution time is omitted)

NOTE: The parsing of the wc command results in the example is pretty crude and sloppy. Please do not imitate.

`good_performance_with_batch.py`


import pathlib
from subprocess import check_output
cwp = pathlib.Path("/etc")
files = [f for f in cwp.glob("**/*") if f.is_file()]

#Batch processing by passing all files as arguments to the wc command
args = [str(f) for f in files]
# NOTE:In python*Can expand the list of args as an argument with
result = check_output(["wc", "--byte", *args])
total_byte = int(str(result).split(r"\n")[-2].split()[0])
    
print("file num:", len(files))
print("total byte:", total_byte)

Batching command processing	Execution time
None	12,600 ms
Yes	338 ms

Even though I used the same wc command with the --bytes option, I was able to reduce the processing time to about 1/40 by batch processing.

There are some Unix / Linux commands and various libraries (ex. SQL, HTTP, I / O) that provide such batch processing in some way. If you have a performance problem, use it positively.

By the way, for those who want to pursue batch processing of external commands, the following articles will be helpful. The title says "shell script", but basically it is an article about how to use external commands efficiently.

To keep shell scripts tens of thousands of times slower ——filter without looping Continued: In order not to slow down shell scripts by tens of thousands of times —— Pipes are still fast and easy to understand

Use buffer for I / O processing (= reduce the number of system calls)

In situations where performance requirements are high, system calls around I / O processing such as reading and writing files and sending and receiving networks tend to become major bottlenecks. For this reason, buffering I / O processing is important to reduce system calls.

NOTE: The important thing is "what can we do to reduce the number of system calls?" And we are using I / O buffering as an effective means of doing so.

For example, when reading and writing files, take the appropriate buffering approach for each language.

--For Java: Use BufferedReader, BufferedWriter, etc. --For C language: Use fwrite, fread functions, etc. --For Python: Buffering is enabled by default when open.

An example of Python is shown below. In the case of Python, the I / O buffer is enabled by default, so let's check the effect by intentionally disabling it. (Processing such as measurement of execution time is omitted)

# buffering=If 0 is specified, I/O buffer becomes invalid
f = open("file.txt", "wb", buffering=0)
s = b"single line"
for i in range(500000):
    f.write(s)
f.close()

The following comparison results also show that enabling the I / O buffer improved the write processing performance by about 10 times.

With or without buffer	Execution time
None	2,223 ms
Yes	245 ms

Do not use for, while loops (scripted language)

Scripted languages such as Python, Ruby, and Perl are very slow in processing speed compared to programming languages such as C and Java that are executed as native code due to the mechanism of operation. In particular, it is not uncommon for simple operations and loop processing to be tens to hundreds of times slower.

In situations where processing performance is important even in a script language, let's delegate the part corresponding to loop processing by the following method.

--Use built-in functions and language features --Python list comprehension --Functions such as map, filter, apply (preferably implemented in C / C ++ language) --Use libraries and modules implemented in C language, etc.

NOTE: It may be a little confusing, but it is one of the methods related to Principle 2 because "the loop processing of the script type language is batch processed by the C language implementation". It is important to be able to think like this, so I want to be aware of it on a daily basis.

As an example, let's compare the case where the summation of the numerical list is processed by the loop processing in Python and the case where it is performed by the sum function. (Processing such as measurement of execution time is omitted)

nums = range(50000000)

#For loop processing
total  = 0
for n in nums:
    total += n

#For sum function
total = sum(nums)

The comparison results are shown below. It's about 5 times faster.

Method of calculation	Execution time
Loop processing	3,339 ms
sum function	612 ms

Since the sum function is implemented in C language, it processes at high speed. Since the join function and map function are also implemented in C language, you can avoid loop processing on the Python side and speed up by making good use of them.

Principle 3. Reuse high-cost items

Select the appropriate string concatenation

Depending on the programming language, the string concatenation process that combines multiple strings into a single string can be very costly.

In the case of Python and Java, a new string object is created each time a string is combined, which is very wasteful. For this reason, use a technique that utilizes StringBuilder for Java and join () for Python. For details, refer to the following articles.

[Java] Speed comparison of string concatenation How to increase the processing speed of Python Part 2: Use join () to concatenate a large number of strings

By the way, since Java 7, it has been optimized to automatically use StringBuilder only for one-line string concatenation such as s =" hello "+" s "+" !! ". However, note that this optimization does not apply to string concatenation processing for variables outside the loop.

Reuse the same connection

If you use an HTTP library when sending a request such as HTTP, depending on the library, a connection will be created and destroyed for each request. For example, requests.get () in Python's requets library corresponds to this pattern.

If you send tens to hundreds of requests per second, be sure to use the same connection with Persistent Connection of HTTP 1.1 or later.

Let's compare the case where Session is used in the Python requests library and the case where it is not used. (Processing such as measurement of execution time is omitted)

import requests

NUM = 3000
url = 'http://localhost:8080/'

#Without Session
def without_session():
    for i in range(NUM):
        response = requests.get(URL)
        if response.status_code != 200:
            raise Exception("Error")

#With Session
def with_session():
    with requests.Session() as ses:
        for i in range(NUM):
            response = ses.get(URL)
            if response.status_code != 200:
                raise Exception("Error")

without_session()
with_session()

The comparison results are shown below. If there is a Web server on the same host (Local), it is about 1.2 times faster, and if there is a Web server on the Internet (Internet), it is about 2 times faster. You can see that the higher the connection cost to the server, the greater the effect.

Computational complexity	Execution time
Internet +No Session	7,000 ms
Internet +With Session	3.769 ms
Local +No Session	6,606 ms
Local +With Session	5.538 ms

Other

Performance tuning topics that are not directly related to principles.

Do not generate processing when outputting DEBUG log

If the DEBUG log uses something other than a fixed character string, even if it is set not to output the DEBUG log at the log level, the part that passes the argument will be calculated, so processing costs will be incurred. Especially if there is such a DEBUG log in a place where it is executed thousands or tens of thousands of times per second, the performance impact will be very large.

When inputting DEBUG log output with calculation, be sure to combine the if statement of log level check so that calculation processing does not occur at the log level where DEBUG log output is not performed.

l = [1, 2, 3, 4, 5]

# BAD:
#Performs summation and string concatenation of lists regardless of log level
logging.debug("Sum=" + sum(l))

# GOOD:
if logging.isdebug():
    #Not implemented at all for DEBUG level
    logging.debug("Sum=" + sum(l))    

# GOOD:In the case of a fixed character string, calculation does not occur, so it is OK
logging.debug("Entered the function A")

NOTE: This is not the case for programming languages that are lazy evaluated.

[PYTHON] An introductory guide to program performance tuning

Conclusion

background

Features of performance tuning to be introduced

Performance tuning principles

Principle 1. Reduce the amount of calculation

Example of judgment criteria

Supplement: Computational complexity coefficient

Principle 2. Batch high-cost processing

Example of judgment criteria

Principle 3. Reuse high-cost items

Example of judgment criteria

Flow of applying the principle

Tuning sample

Principle 1. Reduce the amount of calculation

Utilize HashMap / Set etc. for multiple loops

bad_performance_no1.py

good_performance_no1.py

Principle 2. Batch high-cost processing

Summarize mass execution of external commands

bad_performance_no_batch.py

good_performance_with_batch.py

Use buffer for I / O processing (= reduce the number of system calls)

Do not use for, while loops (scripted language)

Principle 3. Reuse high-cost items

Select the appropriate string concatenation

Reuse the same connection

Other

Do not generate processing when outputting DEBUG log

`bad_performance_no1.py`

`good_performance_no1.py`

`bad_performance_no_batch.py`

`good_performance_with_batch.py`