I tried to verify and analyze the acceleration of Python by Cython

Introduction

Languages "Python" that are classified as dynamically typed include "Cython" that converts Python code to C / C ++ and compiles it, and "Numba" that accelerates Python using a JIT compiler. I wanted to improve efficiency by incorporating these into the Python code used in research and internships, and in order to deepen my understanding, I tried to compare and verify the processing speed for actual simple processing on jupyter. ..

This article is not an introductory article on Cython. What is Cython? If you want to know the basic knowledge and usage, we recommend the following articles.

-Introduction to cython -Introduction to Cython without going deep

Please feel free to comment if you have any opinions or mistakes. The experimental code is here

Experiment 1 Comparison with is_prime function

I tried to increase the speed by using the simple ʻis_prime` function (primality test function) shown below as a baseline. About the code The same part as the baseline is omitted. Regarding Cython, we verified the difference in speed depending on the type location.

baseline


def find_factor(n):
    answer = n
    for i in range (2,n):
        if n % i == 0:
            answer = i
            break
    return answer

def is_prime(n):
    if find_factor (n) != n:
        return False
    else:
        return True
Speed up with Numba
from numba import jit
@jit
def find_factor0 (n):
    #Abbreviation
def is_prime0 (n):
    #Abbreviation
Acceleration with Cython 1 (Cython1)

Just compile with Cython without changing the code

% load_ext Cython
%% cython
def find_factor1(n):
    #Abbreviation
def is_prime1(n):
    #Abbreviation
Acceleration with Cython 2 (Cython2)

Type all variables

% load_ext Cython
%% cython
def find_factor2(int n):
    cdef int answer = n
    cdef int i
    #Abbreviation
def is_prime2(n):
    #Abbreviation
Acceleration with Cython 2-1 (Cython2-1)

Type definition only i of for statement

% load_ext Cython
%% cython
def find_factor2_1(n):
    answer = n
    cdef int i
    #Abbreviation
def is_prime2_1(n):
    #Abbreviation
Acceleration with Cython 2-2 (Cython2-2)

Type only answer

% load_ext Cython
%% cython
def find_factor2_2(n):
    cdef int answer = n
    #Abbreviation
def is_prime2_2(n):
    #Abbreviation
Speeding up with Cython 2-3 (Cython2-3)

Type definition only argument n

% load_ext Cython
%% cython
def find_factor2_3(int n):
    #Abbreviation
def is_prime2_3(int n):
    #Abbreviation
Acceleration with Cython 2-4 (Cython2-4)

Type definition other than argument n

% load_ext Cython
%% cython
def find_factor2_4(n):
    cdef int answer = n
    cdef int i
    #Abbreviation
def is_prime2_4(n):
    #Abbreviation
Acceleration with Cython 3 (Cython3)

Define the findfactor function with cdef

% load_ext Cython
%% cython
cdef int find_factor3(int n):
    cdef int answer = n
    cdef int i
    #Abbreviation
def is_prime3(n):
    #Abbreviation

In all experiments, the time required to determine the prime number 131071 was measured by % timeit. The average time and standard deviation of the processes executed 100 times each are as follows.

Comparison of speed of is_prime function
Baseline Numba Cython1
No code change
Cython2
The function itself
Cython3
All variables in the function
time(ms) 8.69±0.2 0.49±0.0 5.34±0.1 0.43±0.0 0.43±0.0
Comparison of speed by type definition in Cython
Cython2-1
for statement i only
Cython2-2
answer only
Cython2-3
Argument n only
Cython2-4
Other than argument n
time(ms) 4.81±0.1 5.27±0.4 6.62±0.1 4.85±0.1

From the result of Cython2, it is the fastest for Cython when all the type information is defined. Now let's consider the change in velocity due to the difference in the type definition. As you can see from the difference between Cython2-1 and Cython2-2, the effect of type-defining variables is largely due to the former (type-defining iterator variable ʻi). This is because the type check is entered for each operation in the increment calculation performed in the for loop (equivalent to the process of calling ʻint .__ add__ every time), so if it is executed a lot, the overhead will be high, but the type definition It is thought that this overhead is eliminated by calculating ʻi + 1` directly on the C language.

Also, paying attention to Cython2-3, the argument-only type definition is slower than Cython1. I thought it should be faster by removing the dynamic type checking in the arguments when calling the find factor function, but it's actually slower. Considering that there is a compilation overhead due to the argument type definition rather than those effects, we defined the following functions hoge1, hoge2, which do nothing but differ only in the presence or absence of the argument type definition, and compared the speeds. However, there was a slight difference.

%% cython
def hoge1 (n,a,b,c):
    return
def hoge2 (int n, int a, int b, int c):
    return

However, even with the above function, the difference was about 10 ns (10 million loops are rotated and the difference should not be an error from the standard deviation value), and in addition, there was almost no difference when there was one argument, so this overhead It is not considered that the difference is due to. You can also see from Cython2-4 that it is faster to type the arguments if all the variables in the find factor are typed. From the above, the cause of the delay in Cython2-3 is the dynamic when only one of the n% i == 0 operations is type-defined than when neither n nor i is type-defined. I think the conclusion is that the overhead of the check is larger, but I didn't understand the reason.

The result of typing the find factor itself from Cython3 was not much different from Cython2. The arithmetic processing itself in find factor is the same as Cython2, so I'm convinced. However, in Cython3, the overhead when calling find factor should be removed, so calling find factor once makes almost no difference, but I think that Cython3 will be faster in the process of calling a lot. I created the following function and verified it.

%% cython

cdef int find_factor3_1(int n):
    #Abbreviation
def find_factor3_2(int n):
    #Abbreviation

def is_prime_inf1(n,m):
    for i in range (m):
        find_factor3_1 (n)

def is_prime_inf2(n,m):
    for i in range (m):
        find_factor3_2 (n)
Comparison when the find factor is called 10,000 times
find factor3-1(cdef) find factor3-2(def)
time 158 μs 3.1s

As expected, calling a lot of find factors made a significant difference. Therefore, it was found that the overhead of calling by defining the find factor itself is removed.

Experiment 2 Comparison by matrix product

I tried to increase the speed using the matrix product as a baseline.

Implementation by Python
def dot(a, b):
    n = len(a)
    l = len(a[0])
    m = len(b[0])
    c = np.zeros((n, m))
    for i in range(n):
        for j in range(m):
            for k in range(l):
                c[i][j] += a[i][k] * b[k][j]
    return c
Implementation by Numba
from numba import jit
@jit
def nm_dot(a, b):
    #Abbreviation
Implementation by Cython 1 (Cython1)

For the time being, type definition is done based on the knowledge so far and implemented with cython. Change to [i] [j][i, j] to speed up, declare type definitions together, change to range xrange, turn off the check function, etc. We are devising.

%%cython
import numpy as np
cimport numpy as np

cython: boundscheck=False
cython: wraparound=False
    
cpdef np.ndarray  c_dot(np.ndarray a, np.ndarray b):
    cdef np.ndarray c
    cdef int n,l,m,i,j,k
    n = len(a)
    l = len(a[0])
    m = len(b[0])
    c = np.zeros((n, m))
    for i in xrange(n):
        for j in xrange(m):
            for k in xrange(l):
                c[i,j] += a[i,k] * b[k,j]
    return c
Implementation by Cython 2 (Cython2)

I made the type definition for the numpy array more strictly by referring to the official document.

cpdef np.ndarray[dtype=float,ndim = 2]  c_dot2(np.ndarray[dtype=float,ndim = 2] a, np.ndarray[dtype=float,ndim = 2] b):
    cdef np.ndarray[dtype=double,ndim = 2] c
    #Abbreviation

In this experiment, we generated a 100 * 100 random Numpy matrix and measured it with % timeit as in Experiment 1. The average and standard deviation of 10 processes for the time-consuming implementation of Python and Cython 1 and 100 processes for other cases are as follows.

Comparison of matrix product speeds
Python Python(Numpy) Numba Cython1 Cython2
time(ms) 1220±54 0.033 1.1 729±16 2.5

It's a bit faster in Cython1 than the naive Python implementation, but much faster in Cython2. In the implementation in Cython1, when the type of each numpy array is defined as np.ndarray, the actual dtype and dimension are not yet determined and it is subject to dynamic check after all, which is equivalent to the processing when Cython is not used. It is thought that this is because it is stored. As you can see from the results, Numpy was overwhelmingly fast when it came to calculating matrix products. In the first place, I used Numpy a lot, but I didn't know the principle why it was so fast, so I did a little research (I should have done it first). , It was said that it was implemented in C language internally. Here too, typing came out and I realized the importance of typing. At least for calculations implemented as modules in numpy such as matrix products, it is better to use numpy as it is, Cython when it is not implemented as a numpy module and can not be realized without actually accessing the numpy array many times. It seems good to use.

in conclusion

Through this experiment, the practical introduction of Cython does not type in the dark clouds, but efficiently improves performance while considering the advantages of Python such as code flexibility and readability, and the use of excellent libraries such as Numpy. I felt that it was good to determine the part to be used.

Reference material

-Introduction to cython -Introduction to Cython without going deep

Recommended Posts

I tried to verify and analyze the acceleration of Python by Cython
I tried to verify the yin and yang classification of Hololive members by machine learning
I tried to verify the result of A / B test by chi-square test
I tried to analyze the New Year's card by myself using python
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 1
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 2
I tried to summarize the string operations of Python
I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University
I tried to find the entropy of the image with python
[Python] I tried to visualize the follow relationship of Twitter
I want to know the features of Python and pip
I tried to enumerate the differences between java and python
I tried to automate the article update of Livedoor blog with Python and selenium.
I tried to compare the processing speed with dplyr of R and pandas of Python
[Python] I tried to analyze the characteristics of thumbnails that are easy to play on YouTube by deep learning
[Introduction to Python] I compared the naming conventions of C # and Python.
I tried to improve the efficiency of daily work with Python
I tried to get the number of days of the month holidays (Saturdays, Sundays, and holidays) with python
I tried web scraping to analyze the lyrics.
I tried moving the image to the specified folder by right-clicking and left-clicking
I tried to touch the API of ebay
[Python] I tried to analyze the pitcher who achieved no hit no run
I tried to correct the keystone of the image
I tried using the Datetime module by Python
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to extract and illustrate the stage of the story using COTOHA
I tried to summarize the contents of each package saved by Python pip in one line
I want to analyze the emotions of people who want to meet and tremble
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
Qiita Job I tried to analyze the job offer
I tried to streamline the standard role of new employees with Python
[Linux] I tried to verify the secure confirmation method of FQDN (CentOS7)
I tried to get the movie information of TMDb API with Python
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to open the latest data of the Excel file managed by date in the folder with Python
A super introduction to Django by Python beginners! Part 2 I tried using the convenient functions of the template
I tried to deliver mail from Node.js and Python using the mail delivery service (SendGrid) of IBM Cloud!
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
[Python] I tried to judge the member image of the idol group using Keras
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried to automate the 100 yen deposit of Rakuten horse racing (python / selenium)
I tried to predict the presence or absence of snow by machine learning.
I tried to rescue the data of the laptop by booting it on Ubuntu
I tried to refactor the code of Python beginner (junior high school student)
I tried to pass the G test and E qualification by training from 50
I tried to automatically send the literature of the new coronavirus to LINE with Python
python beginners tried to predict the number of criminals
I tried to graph the packages installed in Python
I tried to summarize how to use matplotlib of python
I tried to summarize the basic form of GPLVM
I tried to touch the CSV file with Python
I tried to solve the soma cube with python
I checked out the versions of Blender and Python
[Python] I tried to graph the top 10 eyeshadow rankings
I tried to erase the negative part of Meros
I tried to solve the problem with Python Vol.1
I tried to analyze J League data with Python
[Python] I tried to get Json of squid ring 2
I tried to classify the voices of voice actors
[Python & SQLite] I tried to analyze the expected value of a race with horses in the 1x win range ①