I tried to find out as much as possible about the GIL that you should know if you are doing parallel processing with Python

I recently had the opportunity to write an orchestration layer (BFF) application in Python.

Asyncio was introduced from Python 3.4, and I / O bound processing can be handled efficiently even with a single thread, but CPU bound processing still has GIL, so parallel processing can be done under a single process. It will be restricted.

From this, it can be seen that it is suitable for handling multiple I / O bound processes rather than CPU bound as a language characteristic. It is an important factor when making a language selection decision, but I thought that it was necessary to know the mechanism of GIL again for that purpose, so I investigated it.

What is GIL (Global Interpreter Lock)?

What is GIL in the first place?

Formally called Global Interpreter Lock, it is an exclusive locking mechanism found in languages such as Python and Ruby. Looking at these two languages alone may seem characteristic of dynamically typed languages, but rather they are involved in coordinating with C.

As for Python, CPython implemented in C language is widely used.

Before going into the explanation of GIL, if you explain that "GIL exists in Python", there is a misunderstanding, so let's pick it up a little more carefully.

In the first place, there are multiple implementations in the language Python. The most widely used is CPython implemented in C, which is probably implicitly referred to when the language characteristics of Python are described.

Other typical examples are Jython implemented in Java and IronPython running on the .Net Framework, but these do not have a GIL. Why is CPython used when you hear that alone? You may think that major libraries like NumPy are often implemented in C, and CPython is often used due to the frequency of language implementation updates.

Based on these, we will explain GIL based on the CPython specifications.

About GIL exclusive lock

Now, let's return to the main subject and go into the explanation of GIL, but roughly speaking, ** "Bytecode can be executed only by a single thread that has a lock even under multiple threads, and other threads are in a standby state" ** That is. The lock is released at regular intervals, and another thread that newly acquires the lock executes the program.

The locking mechanism will be described later, but for the time being, it should be recognized that CPU-bound processing can only be executed by one thread, and parallelization of processing is limited.

Let's actually see the effect of GIL in code. First, run a simple program that counts a large number.

countdown.py


def countdown():
    n = 10000000
    while n > 0:
        n -= 1

if __name__ == '__main__':
    start = datetime.now()

    t1 = Thread(target=countdown)
    t2 = Thread(target=countdown)
    t1.start()
    t2.start()
    t1.join()
    t2.join()

    end = datetime.now()
    print(f"Time: {end - start}")

It is a program that counts down 10 million with 2 threads, but the execution time was about 1.1 seconds in my environment. So what about running it in one thread?

if __name__ == '__main__':
    start = datetime.now()
    countdown()
    end = datetime.now()
    print(f"Time: {end - start}")

This finished in about 0.53 seconds. About half of the two threads means that each thread is not running in parallel. This is because CPU-bound processing can only be executed by one thread.

But what about non-CPU bound processing? Replace the countdown with sleep and try running it in 2 threads.

sleep.py


def sleep():
    time.sleep(2.0)

At this time, the process is completed in about 2 seconds. If it is CPU bound, it takes 4 seconds in 2 x 2 seconds, but if it is sleep, it takes half 2 seconds. This is because the lock was released when the sleep was executed, and the waiting thread went to sleep immediately afterwards, which was processed in practically parallel manner.

By the way, locks occur when executing Python bytecode, not necessarily when using the CPU.

Why the GIL exists

So why is there a GIL that limits parallel processing in the first place? This is not a solution I derived by reading the CPython code myself, but the following seems to be the main factors.

--To simplify low-level mechanisms such as memory management and C integration. --CPython is easy to work with libraries implemented in C, but they are usually not thread-safe.

For the above reasons, in order to execute CPython, only one thread needs to be able to operate bytecode, and there is a mechanism called GIL to realize that.

However, this is not a characteristic of the language itself, Python, but is associated with CPython implemented in C. For example, Jython implemented in Java does not have a GIL because conflicts do not occur even under multithreading thanks to thread management by the JVM.

However, CPython is probably used a lot, probably because it is judged that the advantages of being able to utilize C language assets and active updates are greater than the advantages of avoiding the GIL.

Opening and acquiring the GIL

(The description of this item is based on Understanding the Python GIL)

The CPython GIL mechanism has changed with version 3.2, starting from a lock release request called ** gil_drop_request **.

For example, if there is only one thread, execution will continue until a single thread finishes processing. This is because the unlock request has not arrived from anywhere.

On the other hand, it is different when there are multiple threads. Suspend threads wait 5ms by default, then set gil_drop_request to '1'. The running thread then releases the lock and signals it.

スクリーンショット 2019-11-09 11.04.04.jpg

When a thread waiting for a lock receives the signal, it acquires the lock, but at that time it sends a signal to inform that it has acquired it. The thread that just released the lock enters the suspended state by receiving the signal.

スクリーンショット 2019-11-09 11.04.33.jpg (* All images are quoted from Understanding the Python GIL)

After the timeout, multiple threads will repeat the lock acquisition and release in the same way as before, trying to acquire the lock again by setting gil_drop_request.

Timeout time can be changed

Threads waiting for locks wait 5ms by default, which can be referenced in time from the Python code sys.getcheckinterval (). You can also change the interval time with sys.setcheckinterval (time).

Why did it become a method of sending a lock release request?

From Python 3.2, the lock is released by gil_drop_request, but before that, the lock was released per execution unit called tick.

By the way, this can be referenced by sys.getcheckinterval (), but since it is no longer used due to the change of the locking method, the following warning message is displayed.

DeprecationWarning: sys.getcheckinterval() and sys.setcheckinterval() are deprecated.  Use sys.getswitchinterval() instead.

So why did the lock release method change?

As mentioned earlier, the waiting thread now sends a lock release request, but previously the running thread released the lock after 100 ticks of execution units by default. However, this has some problems in a multi-core situation.

Let's start with the single core case. When a running thread releases the lock, it signals one of the waiting threads. The thread that receives the signal is placed in the queue waiting to be executed, but the OS scheduler selects whether the thread that just released the lock or the thread that received the signal is executed next based on the priority. To do.

(The same thread may acquire locks in succession, which may be desirable given the overhead of context switching.)

However, in the case of multi-core, both for executable thread there is more than one also trying to acquire the lock, one will fail to acquire the lock. Attempting to acquire a lock unnecessarily is also an overhead, and the trouble is that waiting threads can hardly acquire a lock.

Waiting threads have a time lag before resuming, so when you try to acquire a lock, the thread you just released is often already acquiring the lock. It seems that one thread may keep the lock for more than tens of minutes in a long process.

In addition, in cases where I / O processing that ends immediately due to the buffering of the OS frequently occurs, there is also the disadvantage that the load increases because locks are released and acquired one after another each time I / O is waited for. ..

Given the above issues, the current method of sending requests by waiting threads is better.

Disadvantages of the current GIL

So, if there is a problem with the current GIL, it is not. The material Understanding the Python GIL introduces two disadvantages.

1. Unfair lock acquisition may occur

First, if there are three or more threads, the thread that requested the lock release may not be able to acquire the lock and may be taken by the delayed thread.

スクリーンショット 2019-11-09 12.23.39.jpg (* Quoted from Understanding the Python GIL)

In the image above, thread 2 requests a lock release after a timeout, and thread 1 signals the lock release as well. Originally, thread 2 should have acquired the lock, but in the meantime, thread 3 was queued later, so the lock was acquired preferentially.

In this way, depending on the timing, lock acquisition may be biased to a specific thread, and parallel processing may become inefficient.

2. Inefficiency may occur due to Convoy Effect

Also, if a CPU-bound thread and an I / O-bound thread are running at the same time, an inefficient state called Convoy Effect may occur.

From the viewpoint of the whole process, I / O bound threads are given priority to hold locks, when I / O waits, they move to CPU bound threads, and when I / O is completed, locks are given priority again. It is efficient to let it. On the other hand, if only CPU-bound threads have locks, I / O-bound processing will remain, and the execution time will be extended by the amount of waiting time for I / O waiting.

However, threads have no priority, so you have no control over which thread acquires the lock preferentially. If two threads are waiting, the CPU-bound thread may acquire the lock first.

Also, even if the I / O ends immediately, it is necessary to wait until the timeout. If a large amount of I / O waits occur, CPU-bound processing may end while waiting for sequential timeouts, leaving only I / O waits.

This is called the "Convoy Effect", but since it only requires the lock to be released after a timeout, it can be inefficient from the perspective of overall optimization.

Perform parallel processing in multiple processes

As many of you know, CPU-bound processing can be executed in parallel by making it multi-process. This is because each process holds an interpreter and the GIL exists on an interpreter basis.

Let's try to execute the part that was processed in multi-thread earlier in multi-process.

countdown.py


def countdown():
    n = 10000000
    while n > 0:
        n -= 1

if __name__ == '__main__':
    start = datetime.now()

    t1 = Process(target=countdown)
    t2 = Process(target=countdown)
    t1.start()
    t2.start()
    t1.join()
    t2.join()

    end = datetime.now()
    print(f"Time: {end - start}")

Processing that took about 1.1 seconds under multi-process is now about 0.65 seconds under multi-process. You can see that it can be executed in parallel even with CPU bound.

Although it has a large overhead compared to threads, it can share values between processes and is useful when executing CPU-bound processing in parallel.

Sub-interpreter introduced in Python 3.8

At the time of this writing, the sub-interpreter was tentatively implemented in the just-released Python 3.8. The sub-interpreter is proposed in PEP 554 but has not yet been merged.

As mentioned earlier, the GIL exists on an interpreter basis, but sub-interpreters allow you to have multiple interpreters in the same process.

It's an idea with potential in the future, but since CPython has state in the Runtime, it seems that it still has a lot of problems to hold state in the interpreter.

You can actually use it by upgrading to Python 3.8 and importing _xxsubinterpreters, but it may still be difficult to use at the production level.

Leverage event loops with asyncio

It's a methodological story that deviates from the main point of explaining the GIL, but in the case where multiple I / O waits occur in the current Python, it may be more practical to utilize the event loop by ʻasyncio`.

ʻAsyncio` has I / O multiplexing that allows multiple I / O operations to be processed efficiently in a single thread, which is close to the benefits of multithreading.

In addition to saving memory compared to multithreading, there is no need to consider lock acquisition / release of multiple threads, and native coroutines by async / await can be written intuitively, which will reduce the thinking burden on programmers.

The Python coroutines will be introduced in detail in a separate article.

Finally

The story spread in the second half, but this article has comprehensively described topics related to the GIL.

It may not be a point to be aware of every day when writing application level code, but I thought that knowing what restrictions there are when performing parallel processing would be useful for language selection, so write an article. I did.

Reference material

Recommended Posts

I tried to find out as much as possible about the GIL that you should know if you are doing parallel processing with Python
I tried to find out if ReDoS is possible with Python
I tried to find out how to streamline the work flow with Excel x Python ④
I tried to find out how to streamline the work flow with Excel x Python ⑤
I tried to find out how to streamline the work flow with Excel x Python ①
I tried to find out how to streamline the work flow with Excel x Python ③
I tried to find the entropy of the image with python
I tried to find out the outline about Big Gorilla
I tried to find out how to streamline the work flow with Excel × Python, my article summary ★
I tried to implement merge sort in Python with as few lines as possible
I tried to summarize the operations that are likely to be used with numpy-stl
I tried to put out the frequent word ranking of LINE talk with Python
I used Python to find out about the role choices of the 51 "Yachts" in the world.
I tried to compare the processing speed with dplyr of R and pandas of Python
I tried to touch the CSV file with Python
I tried to solve the soma cube with python
I tried to solve the problem with Python Vol.1
I tried to summarize what python strong people are doing in the competition professional neighborhood
I tried to simulate how the infection spreads with Python
I tried to analyze the whole novel "Weathering with You" ☔️
I tried to find the average of the sequence with TensorFlow
[Python] I tried to visualize tweets about Corona with WordCloud
I tried to divide the file into folders with Python
I tried to find out the difference between A + = B and A = A + B in Python, so make a note
A story that didn't work when I tried to log in with the Python requests module
I tried to solve the ant book beginner's edition with python
I want to know the weather with LINE bot feat.Heroku + Python
[Python] A memo that I tried to get started with asyncio
I want to know if you install Python on Mac ・ Iroha
I tried to improve the efficiency of daily work with Python
How to find if you don't know the Java installation directory
python beginners tried to find out
I tried to implement a blockchain that actually works with about 170 lines
If you want to include awsebcli with CircleCI, specify the python version
Tips (input / output) that you should know when programming competitive programming with Python2
Mayungo's Python Learning Episode 2: I tried to put out characters with variables
I tried to get the authentication code of Qiita API with Python.
I tried with the top 100 PyPI packages> I tried to graph the packages installed on Python
I tried to streamline the standard role of new employees with Python
I tried to visualize the text of the novel "Weathering with You" with WordCloud
Tips (control structure) that you should know when programming competitive programming with Python2
I tried to get the movie information of TMDb API with Python
Tips (data structure) that you should know when programming competitive programming with Python2
[Feature-length poem] I don't understand functional languages Even if you understand Python: Part 2 The functions that generate functions are amazing.
[Python] I tried to analyze the characteristics of thumbnails that are easy to play on YouTube by deep learning
I tried to get the information of the .aspx site that is paging using Selenium IDE as non-programming as possible.
python I don't know how to get the printer name that I usually use.
I used gawk to find out the maximum value that goes into NF.
I tried to extract named entities with the natural language processing library GiNZA
[Python] I tried to solve 100 past questions that beginners and intermediates should solve [Part 5/22]
[Python] I tried to solve 100 past questions that beginners and intermediates should solve [Part7 / 22]
I tried using the Python library "pykakasi" that can convert kanji to romaji.
I tried to explain what a Python generator is for as easily as possible.
I tried to predict the horses that will be in the top 3 with LightGBM
[Python] I tried to solve 100 past questions that beginners and intermediates should solve [Part 4/22]
[Python] I tried to solve 100 past questions that beginners and intermediates should solve [Part3 / 22]
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
[Python] I tried to solve 100 past questions that beginners and intermediates should solve [Part 1/22]
[Python] I tried to get the type name as a string from the type function
[Python] I tried the same calculation as LSTM predict with from scratch [Keras]
[Python] I tried to solve 100 past questions that beginners and intermediates should solve [Part 6/22]