Python code acceleration approach

Introduction

Hello. Good evening. My name is Tameoka. I have been working as a machine learning engineer at Globis Co., Ltd. since April 2020. At Globis, I am in charge of projects that use machine learning technology and projects that improve the operation of data infrastructure.

I think there are various systems that use machine learning technology, In the current Globis, machine learning is performed in response to user requests. We don't deal with systems that need to return results immediately, We are only dealing with systems that take a certain amount of time to learn and estimate the results asynchronously to the application. Therefore, for now, when writing logic using machine learning technology, we don't really care about speed.

On the other hand, I also do competitive programming in my spare time, I personally like to think about and write fast code.

This time, I will talk about Python code that is often used in machine learning. I have summarized the procedure while actually trying to speed up using various means. I hope you can see it if you like.

Execution environment

hardware

--Mac Book Pro 2019 model --Processor: 2.6 GHz 6 core Intel Core i7 --Memory: 16 GB

software

--Use python: 3.9.1-buster for the Docker image of the Python runtime environment. --Resources are set to be fully available. image.png

Directory structure

The directory structure looks like this. The contents of each file will be described later. I mount it under the fast_python directory to/in the Docker container and use it.

tree


fast_python
├── cython
│   ├── prime.c
│   ├── prime.cpython-37m-x86_64-linux-gnu.so
│   ├── prime.py
│   ├── prime.pyx
│   └── setup.py
├── pybind11
│   ├── prime.cpp
│   ├── prime.cpython-37m-x86_64-linux-gnu.so
│   └── prime.py
└── python
    ├── fast_prime.py
    └── prime.py

Target of speedup

This time, we will consider the following problems and try to speed up the code that answers them.

For the integer X given as standard input
Find the smallest prime number above X and output to standard output.
However, 2 ≤ X ≤ 10^Let it be 14.

Write without being aware of speed

First of all, let's put aside the speedup and just think about how to solve the problem.

Except for the constraint of 2 ≤ X ≤ 10 ^ 14, for integers greater than or equal to X, "whether or not they are prime numbers" are judged one by one. If it is a prime number, I think it will be a simple problem of returning that value. If you express this in the code, it looks like this.

/fast_python/python/prime.py


def is_prime(x: int) -> bool:  #A function that determines whether it is a prime number
    #I haven't thought about it yet


def minimum_prime_number():
    X = int(input().strip())
    answer = 0
    for i in range(X, (10 ** 14 + 32)):
        if is_prime(i):
            answer = i
            break
    print(answer)


if __name__ == '__main__':
    minimum_prime_number()

Here, in the process of minimum_prime_number (), the stop of range of the for statement is set to 10 ^ 14 + 32. The reason is that the smallest prime number greater than or equal to 10 ^ 14 is 10 ^ 14 + 31. If 10 ^ 14, which is the maximum value of X under the constraint condition, is input, 10 ^ 14 + 31 should be output, so This range should be sufficient in this case.

Next, let us consider the logic for determining whether it is a prime number.

Because a prime number is "a natural number greater than 1 and the only positive divisor is 1 and the number itself" I think the logic to determine whether the integer X is a prime number is as follows. This time we have a constraint of 2 ≤ X, so we don't consider cases where X is less than or equal to 1.

For the integer i from 2 to X, calculate the remainder when X is divided by i.
If there is an integer i with a remainder of 0, it returns False.
Returns True if it does not exist.

When I dropped the above logic into a Python program, it became as follows.

/fast_python/python/prime.py


def is_prime(x: int) -> bool:
    if x <= 1:
        return False
    for i in range(2, x):
        if x % i == 0:
            return False
    return True


def minimum_prime_number():
    X = int(input().strip())
    answer = 0
    for i in range(X, (10 ** 14 + 32)):
        if is_prime(i):
            answer = i
            break
    print(answer)


if __name__ == '__main__':
    minimum_prime_number()

Let's actually run the program.

bash


# @In a Docker container

root@xxxxxxxxxxxx:/fast_python/python# python prime.py
2  #input
2  #output
root@xxxxxxxxxxxx:/fast_python/python# python prime.py
3  #input
3  #output
root@xxxxxxxxxxxx:/fast_python/python# python prime.py
4  #input
5  #output
root@xxxxxxxxxxxx:/fast_python/python# python prime.py
5  #input
5  #output
root@xxxxxxxxxxxx:/fast_python/python# python prime.py
6  #input
7  #output
root@xxxxxxxxxxxx:/fast_python/python# python prime.py
7  #input
7  #output
... (The following is omitted)

Of the prime numbers greater than or equal to the input integer, the smallest one is output, which looks good.

Next, enter 10 ^ 14 as the corner case. We expect 10 ^ 14 + 31 as output.

bash


root@xxxxxxxxxxxx:/fast_python/python# python prime.py
1000000000000  #input
...

Oh, no results are returned.

I waited for about 10 minutes, but no results were returned. It seems that the number entered is too large and the for statement loops a lot, which takes a lot of time.

Acceleration in Python

There is no upper limit on the processing time for this issue, but it's a bit annoying that no results are returned after 10 minutes. I'd like to somehow speed up this code so that the results come back quickly.

As a first step in speeding up your code, it's a good idea to identify where the processing is slow. There are likely to be several specific methods, but the profiler gives you a detailed picture of how long each process takes.

Use profiler cProfile

It seems that there are various Python profiling tools as well, There is a built-in tool called cProfile. Let's use this to see how long each function takes to execute.

If you refer to the Official Document, just execute the following command It seems that you can profile the specified Python code file.

bash


root@xxxxxxxxxxxx:/fast_python/python# python -m cProfile prime.py

This time I want to check the time taken for each process, so sort in the order of tottime Execute with the -s option as shown below.

bash


root@xxxxxxxxxxxx:/fast_python/python# python -m cProfile -s tottime prime.py

What is tottime? "The total time spent on a given function (excluding the time spent on calling sub-functions)".

When I actually executed the above command, I got the following output. For the standard input at the time of execution, I specified 10 ^ 7, which takes some time to process but returns the output properly.

bash


root@xxxxxxxxxxxx:/fast_python/python# python -m cProfile -s tottime prime.py
10000000
10000019
         237 function calls in 2.342 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.675    1.675    1.675    1.675 {built-in method builtins.input}
       20    0.664    0.033    0.664    0.033 prime.py:1(is_prime)
        4    0.000    0.000    0.002    0.000 <frozen importlib._bootstrap_external>:1438(find_spec)
       16    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:62(_path_join)
       16    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:64(<listcomp>)
... (The following is omitted)

Looking at the output result of the profiler, the process that takes the longest time is as follows.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.675    1.675    1.675    1.675 {built-in method builtins.input}

However, this includes the waiting time for standard input. It means that my standard input typing is slow. In short, I think this can be ignored.

The problem is the second line.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       20    0.664    0.033    0.664    0.033 prime.py:1(is_prime)

is_prime () is the slowest function in the actual process, which takes more than 0.6 seconds overall. Apparently, the time taken for other processing is 0.001 seconds or less and can be ignored, and this is the bottleneck.

Make fast python code

I think there are some programs that are difficult to speed up, but in this case we will try to speed up by modifying the logic.

If the processing is taking a long time due to the large number of loops of the for statement, the first thing that comes to mind for speeding up the processing is I think it's about reducing the number of loops. In fact, in this case, you can reduce the number of loops while meeting the requirements for answering the question. In is_prime (), the number of loops can be from 2 to "the largest integer less than or equal to the square root of X". The proof comes out when you google, but this site was easy to understand.

The following is the actual application of this to the code.

/fast_python/python/fast_prime.py


import math


def is_prime(x: int) -> bool:
    if x <= 1:
        return False
    for i in range(2, (math.floor(math.sqrt(x)) + 1)):  #Set the maximum integer less than or equal to the square root as the upper limit
        if x % i == 0:
            return False
    return True


def minimum_prime_number():
    X = int(input().strip())
    answer = 0
    for i in range(X, (10 ** 14 + 32)):
        if is_prime(i):
            answer = i
            break
    print(answer)


if __name__ == '__main__':
    minimum_prime_number()

When I actually executed it, it became as follows. The standard input is unchanged from 10 ^ 7.

root@xxxxxxxxxxxx:/fast_python/python# python -m cProfile -s tottime fast_prime.py
10000000
10000019
         277 function calls in 1.744 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.738    1.738    1.738    1.738 {built-in method builtins.input}
       20    0.001    0.000    0.001    0.000 fast_prime.py:4(is_prime)  # is_prime()Processing time
        4    0.001    0.000    0.002    0.000 <frozen importlib._bootstrap_external>:1438(find_spec)
       16    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:62(_path_join)
       16    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:64(<listcomp>)
... (The following is omitted)

Like this, the processing time of is_prime () has been reduced to 0.001 seconds. The code is_prime () before speeding up took 0.664 seconds, so The processing time is actually 1/664 compared to before the speedup.

In this case, even if you give 10 ^ 14, which took more than 10 minutes to process, to the standard input, the result is likely to be returned.

bash


root@xxxxxxxxxxxx:/fast_python/python# python fast_prime.py
100000000000000
100000000000031

When I actually tried it, the result came back in a few seconds. I'm happy. Let's profile and measure the processing time.

bash


root@xxxxxxxxxxxx:/fast_python/python# python -m cProfile -s tottime fast_prime.py
100000000000000
100000000000031
         318 function calls in 3.529 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.602    2.602    2.602    2.602 {built-in method builtins.input}
       32    0.922    0.029    0.922    0.029 fast_prime.py:4(is_prime)  # is_prime()Processing time
        4    0.000    0.000    0.002    0.001 <frozen importlib._bootstrap_external>:1356(find_spec)
       16    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:62(_path_join)
       16    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:64(<listcomp>)
... (The following is omitted)

The processing time for is_prime () was 0.922 seconds. Before speeding up, the result was not returned even after 10 minutes, so it is a big difference in comparison.

Speed ​​up with Cython

In this way, we were able to reduce the number of loops while satisfying the necessary conditions for solving the problem, Let's think about how to make it even faster.

It seems that speeding up by modifying the logic can not be expected any more, By using Cython, you may be able to speed up the process without changing the existing logic.

To use it, first do pip install as shown below.

bash


root@xxxxxxxxxxxx:/fast_python/cython# pip3 install cython

Next, cut out the processing you want to speed up as shown below into a file with the extension .pyx. I will write the code using Cython here. As a general approach for speeding up using Cython, the variables to be used are as follows. There is a way to declare it as a C language variable in the form cdef. I will try this this time.

/fast_python/cython/prime.pyx


import cython
import math


def is_prime(x: int) -> bool:
    cdef:
        long i, stop  #Declare a variable by specifying the C language type
    stop = math.floor(math.sqrt(x) + 1)
    if x <= 1:
        return False
    for i in range(2, stop):
        if x % i == 0:
            return False
    return True

Next, prepare the following setup file.

/fast_python/cython/setup.py


from distutils.core import setup, Extension
from Cython.Build import cythonize

ext = Extension("prime", sources=["prime.pyx"])
setup(name="prime", ext_modules=cythonize([ext]))

Once these are ready, execute the following command.

bash


root@xxxxxxxxxxxx:/fast_python/cython# python setup.py build_ext --inplace

When executed, it will have a C file called prime.c in the current directory. A shared library file called prime.cpython-37m-x86_64-linux-gnu.so will be created. As a result, the is_prime () defined using Cython is changed as shown below. It will be available by import in your Python code.

/fast_python/cython/prime.py


from prime import is_prime  #Is defined using Cython_prime()Import


def minimum_prime_number():
    X = int(input().strip())
    answer = 0
    for i in range(X, (10 ** 14 + 32)):
        if is_prime(i):
            answer = i
            break
    print(answer)


if __name__ == '__main__':
    minimum_prime_number()

Let's run this code and try profiling.

bash


root@xxxxxxxxxxxx:/fast_python/cython# python -m cProfile -s tottime prime.py
100000000000000
100000000000031
         355 function calls (348 primitive calls) in 3.416 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.755    2.755    2.755    2.755 {built-in method builtins.input}
       32    0.655    0.020    0.655    0.020 {prime.is_prime}  # is_prime()Processing time
        2    0.001    0.001    0.001    0.001 {built-in method _imp.create_dynamic}
        5    0.001    0.000    0.002    0.000 <frozen importlib._bootstrap_external>:1356(find_spec)
       17    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:56(_path_join)
... (The following is omitted)

The bottleneck is_prime () now takes 0.655 seconds to process. Compared to the previous processing time of 0.922 seconds, the processing time is about 2/3, which means that the speed has been further increased.

Write in a faster programming language

The processing time was shortened by speeding up the code and using Cython, but Further speeding up may be required. (I should say that it may not be so much.) In such a situation, writing in a faster programming language may be one of the approaches.

This time, the code of the is_prime () function written in Python, After rewriting in C ++, a fast programming language, try calling it from Python code and executing it.

The code for rewriting the is_prime () function in C ++ is as follows.

/fast_python/pybind11/prime.cpp


bool is_prime(long x) {
    for (long i = 2; i <= sqrt(x); i++) {
        if (x % i == 0)
          return false;
    }
    return true;
}

I want to call this process from Python code. There seem to be several ways to do this, but this time I used pybind11, which is easy to use.

As with Cython, first do pip install as shown below.

bash


root@xxxxxxxxxxxx:/fast_python/pybind11# pip3 install pybind11

Next, add the code for binding as shown below to the C ++ code created earlier.

/fast_python/pybind11/prime.cpp


#include <pybind11/pybind11.h>  //Add here

bool is_prime(long x) {
    for (long i = 2; i <= sqrt(x); i++) {
        if (x % i == 0)
          return false;
    }
    return true;
}

PYBIND11_MODULE(prime, m) {      //from here
  m.def("is_prime", &is_prime);  //
}                                //Add up to here

Finally, execute the following command to compile.

bash


root@xxxxxxxxxxxx:/fast_python/pybind11# g++ -O2 -Wall -shared -std=c++11 -fPIC `python3 -m pybind11 --includes` prime.cpp -o prime`python3-config --extension-suffix`

When executed, it will be called prime.cpython-37m-x86_64-linux-gnu.so in the current directory as in Cython. A shared library file is created. You can now use C ++-defined functions by import in your Python code.

/fast_python/pybind11/prime.py


from prime import is_prime  # C++Function defined in is_prime()Import


def minimum_prime_number():
    X = int(input().strip())
    answer = 0
    for i in range(X, (10 ** 14 + 32)):
        if is_prime(i):
            answer = i
            break
    print(answer)


if __name__ == '__main__':
    minimum_prime_number()

When I executed this and actually measured the processing time using cProfile, it became as follows.

bash


root@xxxxxxxxxxxx:/fast_python/pybind11# python -m cProfile -s tottime prime.py
100000000000000
100000000000031
         138 function calls in 2.760 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.670    2.670    2.670    2.670 {built-in method builtins.input}
       32    0.085    0.003    0.085    0.003 {built-in method prime.is_prime}  # is_prime()Processing time
        1    0.002    0.002    0.002    0.002 {built-in method _imp.create_dynamic}
        1    0.000    0.000    2.756    2.756 prime.py:4(minimum_prime_number)
        1    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap>:882(_find_spec)
... (The following is omitted)

The processing time for is_prime () is now 0.085 seconds. The processing time of is_prime () when using Cython was 0.655 seconds, so From there, we were able to reduce the processing time to about 1/8.

There seems to be other ways to speed up, but this time I'll stop here.

Summary

The correspondence table of the processing time of the bottleneck (is_prime ()) for each means is summarized below. All of these processing times are for 10 ^ 14 as input.

means Bottleneck processing time(Seconds)
No special action 600 seconds or more
for Speeding up to reduce the number of loops 0.922 seconds
for Speeding up to reduce the number of loops&Use Cython 0.655 seconds
for Speeding up to reduce the number of loops& C++ & pybind11use 0.085 seconds

As you can see, there are many ways to speed up Python code. Also, if you use C ++ and pybind11, you can speed up the parts that can be speeded up in the logic. It turned out that a considerable speedup can be expected. In fact, compared to the untouched code, the processing time when using C ++ and pybind11 is less than 1/7000.

This time, the verification is limited to the viewpoint of speed, so if you look only at this It seems that it will be faster if you use pybind11 hard, but At the actual site, I think it is better to decide the policy while considering various factors such as maintainability, man-hours, and resource status.

Until the end Thank you for reading.

Reference book

-High Performance Python --O'Reilly Japan

Reference site

Recommended Posts

Python code acceleration approach
python character code
[Python] Algorithm-aware code
Rewrite Python2 code to Python3 (2to3)
infomap python draw code
Before writing Python code
About Python3 character code
Python Requests status code
OpenCV basic code (python)
Get country code with python
Python code memo for yourself
[Python] Frequently used library code
Debug Python with VS Code
2.x, 3.x character code of python
Stop Omxplayer from Python code
Python frequently used code snippets
Generate QR code in Python
[Python] Sample code for Python grammar
Character code learned in Python
Convert python 3.x code to python 2.x
Document Python code with Doxygen
Python hand play (argparse minimum code)
Python
[Python] Generate QR code in memory
Automatically format Python code in Vim
[Code] Module and Python version output
Write selenium test code in python
Execute Python code from C # GUI
Python parallel / parallel processing sample code summary
Check python code styles using pep8
[Python] Read the Flask source code
Code tests around time in Python
[VS Code] ~ Tips when using python ~
Install python with mac vs code
Installation of Visual studio code and installation of python
Fourier series verification code written in Python
Execute Python code on C ++ (using Boost.Python)
python> link> strftime () and strptime () behavior / code
Write test-driven FizzBuzz code using Python doctest.
Getting Python source code metrics using radon
[Python3] Rewrite the code object of the function
Enumerate duplicate combinations (C ++) → Add Python code
python> coding guide> PEP 0008 --Style Guide for Python Code
A tool for easily entering Python code
Run Python code on A2019 Community Edition
[Python] Get the character code of the file
Get the EDINET code list in Python
[Note] Execute Python code from Excel (xlwings)
Notes on using code formatter in Python
R code compatible sheet for Python users
python setup.py test the code using multiprocess