[PYTHON] EP 17 Be Defensive When Iterating Over Arguments

  • Beware of functions that iterate over input arguments multiple times. If these arguments are iterators, you may seee strange behavior and missing values.

Effective Python

Say you want to analyze tourism numbers for the U.S state of Texas. Imagine the data set is the number of visitors to each city (in milions per year). You'd like to figure out what percentage of overall tourism each city revceives.

To do this you need a normalization function. It sums the inputs to determine the total number of tourists per yearr. Then is divides each city's individul visitor count by the total to find that city's contribution to the whole.

def normalize(numbers):
    total = sum(numbers)
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)

    return result
>>> visits = [15, 35, 80]
>>> percentage = normalize(visits)
>>> percentage
[11.538461538461538, 26.923076923076923, 61.53846153846154]
def read_visits(data_path):
    with open(data_path) as f:
        for line in f:
            yield int(line)

normilize returns []. The cause of this behavior is that an iterator only produces its results single time. If you iterate over an iterator or generator that has already raised a StopIteration exception, you won't get any result the second time around.

>>> it = read_visits('data')
>>> percentage = normalize(it)
>>> percentage
[]
it = read_visits('data')
list(it)
[15, 35, 80]
list(it)
[]

One of solutions may not good one. The copy of the input iterator's contents could be large. Copying the iterator could cause your program to tun out of memory and crash.

def normalize(numbers):
    numbers = list(numbers) # Copy the iterator
    total = sum(numbers)
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)

    return result

One of solutions may not good one, either. One way around this is to accept a function that returns a new iterator each time it's called.

def normalize_func(get_iter):
    total = sum(get_iter()) # New Iterator
    result = []
    for value in get_iter(): # New Iterator
        percent = 100 * value / total
        result.append(percent)

    return result

To use normilize_func, you can pass in a lambda expression that calls the generator and produces a new iterator each time.

precentage = normalize_func(lambda: read_visits(path))

Though it works, having to pass a lambda function like this is clumsy. The better way to achieve the same result is to provide a new container class that implements the iterator protocol.

Iterator protocol

The iterator protocol is now Python for loops and related expressions traverse the contents of a container type. When Python sees a statement like for x in foo it will actually call iter(foo) The iter build-in function calls the foo.__iter__ special method in turn. The __iter__ method must return an iterator object until it's exhausted (and raises a StopIteration exception)

class ReadVisits:
    def __init__(self, data_path):
        self.data_path = data_path

    def __iter__(self):
        with open(self.data_path) as f:
            for line in f:
                yield int(line)

How to ensure that parameters aren't jsut iterators

The protocol states taht when an interator is passed to the iterator build-in function, iter will return the iterator itself. In contrast, when a container type is passed iter, a new iterator object will be returned each time. Thus, you can test an input value for this behavior and riase a TypeError to refect interators.

def normalize(numbers):
    if iter(numbers) is iter(numbers):
        raise TypeError("Must supply a container")
    total = sum(numbers)
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)

    return result

Recommended Posts

EP 17 Be Defensive When Iterating Over Arguments
[Python] Variadic arguments can be used when unpacking iterable elements