Effective Python Note Item 17 Respect for certainty when using iterators for arguments

This is a memo of O'Reilly Japan's book effective python. https://www.oreilly.co.jp/books/9784873117560/ P38~42

Note that iterator calls are stateful

In a function that processes multiple times, when an iterator is used as an argument, it may behave unexpectedly.

** Consider a function that calculates the percentage of visitors in each city to account for the total number of visitors **

def normalize(numbers):
    total = sum(numbers)
    result = []
    for value in numbers:
        persent = 100 * value / total
        result.append(persent)
    return result

visits = [15, 35, 80]
percentages = normalize(visits)
print(percentages)

>>>
[11.538461538461538, 26.923076923076923, 61.53846153846154]

This is fine, but try using a generator in case the amount of data becomes large (the amount of data that causes a memory crash).

def read_visits(data_path):
    with open(data_path) as f:
        for line in f:
            yield int(line)

it = read_visits("visits.txt") #Assuming a file with a large number of numbers
percentages = normalize(it)
print(percentages[:3])

>>>
[]

We expect to get results similar to the code above, but in this case an empty list is returned. The reason is that iterator results are only generated once.

If you express the flow as a flow

  1. First, read_visits () creates it, which is a generator.
  2. normalize () takes a generator it as an argument, and sum () is calculated first.
  3. ** At this point, the generator it count is already exhausted! ** **
  4. After that, for loop processing is performed in normalize, but ** for statement does not work because the generator it count has already run out **
  5. result remains empty, unchanged from declaration

In this case, what is particularly confusing is that no exceptions are returned when the iterator runs out. In python iterative processing, it is not possible to determine whether there is no iterator output or whether it is already attached and StopIteration, so no exception is returned.

To solve this, copy the iterator to the list so that you can call it again and again.

def normalize_copy(numbers):
    numbers = list(numbers) #Make a list of copies of the iterator here
    total = sum(numbers)
    result = []
    for value in numbers:
        persent = 100 * value / total
        result.append(persent)
    return result

it = read_visits("visits.txt")
persentage = normalize_copy(it)
print(persentage)

>>>
[11.538461538461538, 26.923076923076923, 61.53846153846154]

The result is as expected, but by generating a list of numbers in the normalize_copy function, a new memory area will be used. This eliminates the benefits of iterators. Instead of creating a list, consider matching it with a function that returns a new iterator.

Now define a function to pass a new iterator

def normalize_func(get_iter):
    total = sum(get_iter()) #New iterator
    result = []
    for value in get_iter():  #New iterator
        persent = 100 * value / total
        result.append(persent)
    return result

persentages = normalize_func(lambda: read_visits("visits.txt"))
print(list(persentages))

>>>
[11.538461538461538, 26.923076923076923, 61.53846153846154]

It works as expected, but using lambdas is a hassle. Implement the original ** Iterator Protocol ** to get the same result.

The iterator protocol is in charge of processing repeated calls in the container in a loop such as a for statement. (Call next () until StopIteration occurs) Let's create our own container class for this process

class ReadVisits(object):
    def __init__(self, data_path):
        self.data_path = data_path
        
    def __iter__(self):
        with open(self.data_path) as f:
            for line in f:
                yield int(line)

visits = ReadVisits("visits.txt")
percentages = normalize(visits)
print(percentages)

>>>
[11.538461538461538, 26.923076923076923, 61.53846153846154]

The difference from the original read_visits is that due to the newly implemented ReadVisit container class, it is processed twice in the normalize function. This is because each new iterator object is created. This allows visits to be called any number of times. (However, this also has a drawback: it involves reading the input data many times.)

Provide a function to check whether it is a container type to ensure correct processing.

def normalize_defensive(numbers):
    if iter(numbers) is iter(numbers):
        raise TypeError('Must supply a container')
    total = sum(numbers)
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)
    return result

visits = [15, 35, 80]
normalize_defensive(visits)

>>>
[11.538461538461538, 26.923076923076923, 61.53846153846154]

No error

visits = ReadVisits("visits.txt")
normalize_defensive(visits)

>>>
[11.538461538461538, 26.923076923076923, 61.53846153846154]

No error

An error will occur if the input is not container type

it = iter(visits)
normalize_defensive(it)

>>>
TypeError: Must supply a container

Summary

--In iterative processing such as loops, it may show unexpected behavior when iterators are used as arguments. --Container class can be implemented by implementing iterator protocol --You can call it twice and test if it is an iterator by seeing the same value.

Recommended Posts

Effective Python Note Item 17 Respect for certainty when using iterators for arguments
Effective Python Note Item 12 Avoid using else blocks after for and while loops
Effective Python Note Item 20 Use None and the documentation string when specifying dynamic default arguments
Things to watch out for when using default arguments in Python
[python, multiprocessing] Behavior for exceptions when using multiprocessing
A useful note when using Python for the first time in a while
Get note information using Evernote SDK for Python 3
Effective Python Memo Item 3
Effective Python Memo Item 9 Consider generator expressions for large comprehensions
Effective Python Memo Item 19 Give optional behavior to keyword arguments
Note links that may be useful when using Python, Selenium2
Effective Python Note Item 16 Consider returning a generator without returning a list
3 months note for starting Python
Keyword arguments for Python functions
Effective Python Memo Item 11 Use zip to process iterators in parallel
Effective Python Note Item 15 Know how closures relate to function scope
Atom: Note for Indentation Error when copying Python script to shell
A note on using tab completion when running Python interactively on Windows
Python Note: About comparison using is
Precautions when using pit in Python
boto3 (AWS SDK for Python) Note
[TouchDesigner] Tips for for statements using python
[Python] Be careful when using print
[Python] Reasons for overriding using super ()
[Python] Multiplication table using for statement
Precautions when using phantomjs from python
When using MeCab with virtualenv python
Precautions when using six with Python 2.5
[VS Code] ~ Tips when using python ~
When using regular expressions in Python
A memo when setting up a Docker container for using JUMAN ++, KNP, python