To speed up python, summarize the amount of calculation of collection type (list / tuple / dictionary / set) for each purpose.

This is the 6th day of Python Part 2 Advent Calendar 2019. Yesterday was Mr. bluepost59 Extreme inclusion notation. It was a comprehensive introduction to the inclusion notation patterns that are useful for speeding up.

This article is also aimed at speeding up. In order to organize the performance of more basic processing, I would like to summarize the amount of calculation of python's collection type (list / tuple / dictionary / set) according to usage.

The following is a detailed article that summarizes the amount of calculation for each collection type.

On the other hand, there weren't many articles that could overlook various collection-type calculations for each application, so I'd like to write something that can be used as a reference from a different angle.

In addition, I will write about Python, which is a CPython implementation. So it may not be helpful for other implementations such as PyPy.

Computational complexity notation

In this article, the amount of time calculation is expressed in $ O $ notation. Unless otherwise noted, we will discuss ** average complexity **. Also, in this article, $ n $ refers to the length of the target collection type, and $ k $ refers to the length of the collection type to be added or deleted.

symbol Computational complexity Overview
O(1) Constant time Processing speed does not depend on size
O(n) Linear time Processing speed is first-order proportional to size

ref: Computational complexity order

Verification environment

List of uses

-[Get length](# Get length) -[Reference Value](#Reference Value) -Iteration -[in operator](#in operator) -[Value Assignment](# Value Assignment) -[Add Value](# Add Value) -[Delete Value](# Delete Value)

Get length

data structure Computational complexity notation
list/tuple O(1) len(seq)
dictionary O(1) len(dic)
set O(1) len(sets)

You can get the length of any data structure, regardless of size, with the same processing speed. Since the length is not calculated, but the length itself is stored, only the reference cost is required, and it is $ O (1) $.

Value reference

data structure Computational complexity notation
list/tuple O(1) seq[i]
O(k) seq[i:j]
dictionary O(1) dic[key]
set O(1) sets[i]

Values can be referenced at the same speed for any data structure, regardless of size. However, the slice depends on the length of the referenced range.

Iteration

data structure Computational complexity notation
list/tuple O(n) for item in seq
dictionary O(n) for key in dic.keys()
O(n) for item in dic.values()
O(n) for key, item in dic.items()
set O(n) for item in sets

Any data structure can be iterated at about the same speed, regardless of size.

In terms of the amount of calculation, iterating the index and referencing the value in the loop is the same as the processing described in the table, but in reality there is a considerable speed difference, so iterate using the method in the table. I think it's good.

def iter_index(seq:List[int], index:List[int]):
    """Iterate index and refer to it in loop"""
    for idx in index:
        seq[idx]

seq = list(range(10000))
index = list(range(len(seq)))
%timeit iter_index(seq, index)
# >>> 281 µs ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

def iter_list(seq:List[int]):
    """Iterates the collection type directly"""
    for item in seq:
        item
        
%timeit iter_list(seq)
# >>> 119 µs ± 2.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 2 ~3 times faster

in operator

data structure Computational complexity notation
list/tuple O(n) item in seq
dictionary O(1) key in dic.keys()
O(n) item in dic.values()
O(1) (key, item) in dic.items()
set O(1) item in sets

As you can intuitively understand, list / tuple is $ O (n) $. Also, dictionary.values () is $ O (n) $. I think it's worth remembering that the rest is $ O (1) $. This is because the dictionary type key and set type are implemented in the hash table. Please refer to the following article for details. ref: Matter that became explosive just by changing from "in list" to "in set" in Python and the reason

Case where "in set" is slower

It is faster to use the set type for the in operator, but in reality the list type is used more frequently, so I think that there are many cases where the list type is converted to the set type and then passed to the in operator. .. Considering the cost of conversion, some caution is required. (Hereafter, we will examine only list, but the same argument will be made with tuple instead of list.) When I actually measured "in list", "in set", and "conversion from list to set & in set", the following results were obtained.

# 「in list」
lists = list(range(100000))
%timeit 50000 in lists
# >>> 622 µs ± 47.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 「in set」
sets = set(lists)
%timeit 50000 in sets
# >>> 54.8 ns ± 4.39 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

#"Conversion from list to set& in set」
%timeit 50000 in set(lists)
# >>> 2.94 ms ± 61.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

When measuring steam processing at different sizes,

size in set /in list ratio in set(list) /in list ratio
10 3 4
100 19 3
1000 112 3
10000 1118 3
100000 11350 5

In other words

--"In set" is faster than "in list". The larger the size, the larger the speed difference linearly. --However, "conversion from list to set & in set" is ** about 3 to 5 times slower than "in list" **

Therefore, in the following cases, using "in set" may ** rather slow down **.

--List to set conversion required --And the number of times to use "in set" in the converted set is less than about 5 times

Value assignment

data structure Computational complexity notation
list O(1) seq[i] = item
O(k) seq[i:j] = lists
dictionary O(1) dic[key] = item

Assignment refers to an operation that changes the value and does not change the length. The value of set cannot be changed and must be deleted and added.

Add value

data structure Computational complexity notation Remarks
list O(1) seq.append(item) Add value to the back
O(n) seq.insert(item, i) Add value to i th
O(k) seq.extend(list) Combine lists behind
dictionary O(1) dic[key] = item Add element to dictionary
O(k) dic.update(dic2) Combine dictionaries
set O(1) sets.add(item) Add element to set

The amount of calculation for list depends on where you add the value. I think the algorithm should include the value at the end as much as possible. Also, all of the above additions are destructive (there are no ones left before they are added), so you may have an unexpected negative effect if you don't pay attention to the scope of the variable.

Sequential addition or join

When deciding the value to be added after filtering it with a conditional expression, there are two options as follows.

--Sequentially add values during the filter --Combine together after creating a collection type of values to be added by the filter

Intuitively, it seems less wasteful to add sequentially, but in python for is slow and it is better to use comprehension as much as possible, so it is a subtle problem. The results of the trial are shown below. From the result, it seems that there is not much difference in speed. I would like to use the one that is convenient. Personally, I prefer the latter method because the comprehension is (should) have fewer side effects (because less value assignment / definition is required).

list

If the iteration of the value to be added is long, it seems faster to create a list once with comprehension notation and then extend it. Note. 1) If you make a destructive addition, the size will change and the measurement will be inaccurate, so copy the list and send it to the function. Note.2) append is known to be slow, so assign it to a variable and then use it in a loop (Reference: Python list comprehension speed fr / items / 43f90e07e4cebe63aeb6)))

def addition_extend(base:List[int], add:List[int]):
    add = [f for f in add if f]
    base.extend(add)
    return base

def addition_append(base:List[int], add:List[int]):
    base_a = base.append
    for ad in add:
        if ad:
            base_a(ad)
    return base

base = list(range(10000))
add = list(range(10))
%timeit addition_extend(copy(base), add)
# >>> 43.9 µs ± 6.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit addition_append(copy(base), add)
# >>> 39 µs ± 66.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

base = list(range(10))
add = list(range(10000))
%timeit addition_extend(copy(base), add)
# >>> 373 µs ± 45.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit addition_append(copy(base), add)
# >>> 540 µs ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

dictionary

I tried the dictionary as well. The dictionary seems to be faster for sequential assignment. However, since the fluctuation is large, it seems that there is not much difference.

def addition_update(base:Dict[str, int], add:Dict[str, int]):
    add = {f: g for f, g in add.items() if g}
    base.update(add)
    return base

def addition_assign(base:Dict[str, int], add:Dict[str, int]):
    for ad, val in add.items():
        if val:
            base[ad] = val
    return base

base = {str(f): f for f in range(10000)}
add = {str(f): f for f in range(10)}
%timeit addition_update(copy(base), add)
# >>> 365 µs ± 62.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit addition_assign(copy(base), add)
# >>> 312 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


base = {str(f): f for f in range(10)}
add = {str(f): f for f in range(10000)}
%timeit addition_update(copy(base), add)
# >>> 1.16 ms ± 35.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit addition_assign(copy(base), add)
# >>> 919 µs ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Delete value

data structure Computational complexity notation Remarks
list O(1) seq.pop() Delete the value behind
O(n) seq.delete(i) Delete i-th value
dictionary O(1) dic.pop(key) Delete by specifying key
set O(1) sets.discard(item) Delete by specifying a value

The list should be limited to the back as much as possible when deleting values. The problem is what to do if there are multiple values to delete. Intuitively, the more values you delete, the faster it will be to recreate it. We will investigate the following two points in this regard. I will write the result first.

--pop vs slice: ** slice ** looks better --delete vs recreate: ** recreate ** looks better

pop vs slice Popping a list of size n k times will result in $ O (k) $, and slice will result in $ O (n-k) $. However, if n> k, it cannot be pop, and if n <k, it cannot be slice, so I measured the speed with the following code.

def pop(seq:List[int], num:int):
    for i in range(num):
        seq.pop()

def slices(seq:List[int], num:int):
    seq[:-num]

Below is a table that measures the execution speed ratio of pop and slice by changing the length of the list and the number of deletions. *** Italic *** is the case where slice was faster. Basically, slice seems to be better.

pop /slice ratio Number of deletions: 1 10 100 1000
list size: 10 1.2 3.0
100 1.0 1.6 10.9
1000 0.6 0.7 2.1 32.5
10000 0.6 0.5 0.5 1.7

delete vs recreate

If you delete a list of size n k times, it will be $ O (kn) $, and if you recreate it, it will be $ O (n) $. Obviously it seems faster to recreate it, but I measured the speed with the following code.

def delete(lists:List[int], cond:Set[int]):
    for con in cond:
        try:
            lists.remove(con)
        except ValueError:
                pass
    return lists

def recreate(lists:List[int], cond:Set[int]):
    return [f for f in lists if f not in cond]

Below is a table that measures the execution speed ratio of delete and recreate by changing the length of the list and the number of deletions. *** Italics *** is the case where recreate was faster. Basically, recreate seems to be better.

delete /recreate ratio Number of deletions: 1 10 100 1000
list size: 10 0.7 1.5
100 0.3 1.3 3.5
1000 0.2 1.2 12.8 6.0
10000 0.3 1.8 114.7 117.4

Summary

The amount of calculation of collection type is summarized according to usage. Python is said to be a slow language, but I hope it helps you get the processing time you can tolerate!

Tomorrow is typecprint!

refs

Recommended Posts

To speed up python, summarize the amount of calculation of collection type (list / tuple / dictionary / set) for each purpose.
Basic operation list of Python3 list, tuple, dictionary, set
How to write a list / dictionary type of Python3
Get the value of a specific key up to the specified index in the dictionary list in Python
python note: map -do the same for each element of the list
[Python] I tried to summarize the set type (set) in an easy-to-understand manner.
Python> sys.path> List of strings indicating the path to search for modules
Python amateurs try to summarize the list ①
I measured the speed of list comprehension, for and while with python2.7.
Organize Python tools to speed up the initial movement of data analysis competitions
Python script to get a list of input examples for the AtCoder contest
How to get dictionary type elements of Python 2.7
A python amateur tries to summarize the list ②
What to use for Python stacks and queues (speed comparison of each data structure)
Get the value of a specific key in a list from the dictionary type in the list with Python
I compared the speed of the reference of the python in list and the reference of the dictionary comprehension made from the in list.
Output the specified table of Oracle database in Python to Excel for each file
How to count the number of occurrences of each element in the list in Python with weight
How to set the development environment for each project with VSCode + Python extension + Miniconda
[Python] I tried to summarize the array, dictionary generation method, loop method, list comprehension notation
Summary of Python sort (list, dictionary type, Series, DataFrame)
I tried to summarize the string operations of Python
Learning record (6th day) #Set type #Dictionary type #Mutual conversion of list tuple set #ndarray type #Pandas (DataFrame type)
Consideration for Python decorators of the type that passes variables
How to calculate the amount of calculation learned from ABC134-D
Get the number of occurrences for each element in the list
I want to make the Dictionary type in the List unique
I tried to summarize the contents of each package saved by Python pip in one line
Python list, for statement, dictionary
Set Up for Mac (Python)
How to set the output resolution for each keyframe in Blender
How to change the log level of Azure SDK for Python
Extract the index of the original set list that corresponds to the list of subsets.
How to check the memory size of a dictionary in Python
How to use machine learning for work? 01_ Understand the purpose of machine learning
How to set up the development environment of ev3dev [Windows version]
[Python] A program that rotates the contents of the list to the left
[Python] How to create a dictionary type list, add / change / delete elements, and extract with a for statement
[AtCoder for beginners] A story about the amount of calculation that you want to know very roughly