[PYTHON] All the destructive methods that data scientists should know

I was reviewing the code of a colleague's data scientist and said, "It's dangerous to do a destructive method on a function argument", but I couldn't explain it properly due to lack of vocabulary, so I wrote an article. I made it.

Certainly, there is no destructive processing in R language (isn't it?), So I feel that I can't learn it just by using numpy or pandas with that feeling. I searched for good documentation and articles, but I found only Ruby's, which seemed difficult to explain, so I wrote it.

What is a destructive method?

Here is an example of the destructive / non-destructive process of "Adding a list to the end of a list".

#Destructive example
x = [1, 2, 3]
x.extend([4, 5, 6])
print(x)
# => [1, 2, 3, 4, 5, 6]
#Non-destructive example
x = [1, 2, 3]
y = x + [4, 5, 6]
print(y)
# => [1, 2, 3, 4, 5, 6]

print(x)
# => [1, 2, 3]

In Python, list.sort () and list.reverse () are provided as list methods, and sorted and reversed are provided as non-destructive processing functions for "sorting lists". It has been. See this official document for more information.

We also use the term "destructive method" for convenience, but it may be a function, although it is rare in Python's standard functionality. For example, random.shuffle that sorts the list randomly.

Things to watch out for when using destructive methods

The disadvantage is that if you perform a destructive operation on a function argument, the argument itself will be changed and the code will be difficult to follow.

Consider the following "function that processes data to assemble json".

import json
from typing import List

def convert_to_json(values: List[int]) -> str:
    """Function to save the result in json format
    """
    #In the process of processing the data`append`Will perform destructive processing
    new_value = 3
    values.append(new_value)
    result = {"values": values, "size": len(values)}
    return json.dumps(result)

x = [1, 2, 3]
log = convert_to_json(x)

#The calculation result was obtained!
print(log)
# => {"values": [1, 2, 3, 3], "size": 4}

# convert_to_x itself will be changed inside json!
print(x)
# => [1, 2, 3, 3]

Did you see that the variable x has been updated? If you do this unintentionally, you will end up crying and debugging endlessly after noticing "What? A value has been added to the list at ** somewhere ** in the program?"

In order not to do it unintentionally, it is ok to perform non-destructive processing as follows, for example.

def convert_to_json(values: List[int]) -> str:
    # Note:Against the list`values += [3]`If you do, it will be a destructive operation, so be careful!
    # https://stackoverflow.com/questions/2347265/why-does-behave-unexpectedly-on-lists
    #It is a first-time killing, but it seems that it is implemented with an emphasis on execution efficiency
    values = values + [3]
    result = {"values": values, "size": len(values)}
    return json.dumps(result)

Alternatively, you can copy it at the beginning of the function as shown below. There are some disturbing comments here and there, but I will tell you when I get stuck because there is no drill w

def convert_to_json(values: List[int]) -> str:
    # Note: `copy`Note that the method does not copy up to the nested hierarchy of lists and dictionaries!
    # https://qiita.com/ninomiyt/items/2923aa3ac9bc06e6a9db
    #For details, deep copy/Check with shallow copy
    copied = valus.copy()
    copied.append(3)
    result = {"values": values, "size": len(values)}
    return json.dumps(result)

Also, if you don't use x after this, you may not need to worry too much. However, be aware that doing so will cause the same problem when reusing the function elsewhere. The same is true for destructive processing of global variables (and variables outside the function).

When should destructive methods be used?

You might ask, "If there is such a risk, isn't it just non-destructive processing?", But the destructive method does not create new values, so the efficiency of the program is good. Python's Official documentation comparing list.sort and sorted also says:

Slightly more efficient if you don't need the original list.

For example, "the process of manipulating a list with a tremendous number of values" may use up memory every time you make a copy. If you're a data scientist, you're likely to run into these memory efficiency issues when you're doing a lot of rows with pandas.DataFrame.

In that case, let's use destructive methods properly. However, in pandas, destructive / non-destructive operations are difficult to guess from the method name, and I have seen cases where huge data is copied inside the method. / items / f8c562e0938271695576) so you may need to get used to it.

As a digression, there is no destructive processing in R language (a copy is made internally every time), so there is no way to improve the efficiency of such programs, and it is a general-purpose programming language that can be used properly depending on the case. I think it is one of the merits of using a certain Python.

To put it a little esoteric, it is necessary to consider the trade-off with the efficiency of the program, keeping in mind "** localizing side effects **". I just quote the article "Unit Testing in R" because it had a good picture.

image.png

Effects other than the function's input (argument) and output (returned value) are called side effects. Data other than arguments (global variables, external databases), etc.

Using a function that has no side effects (pure function) has advantages such as "easy to test" and "easy to reuse". For example, if you write code in pytest, the test will end easily,

def test_f():
    x = [1, 2, 3]
    y = f(x)
    assert y == [1, 2, 3, 4, 5, 6]

If you are throwing SQL to the database in the function f or accepting input from the terminal, you can not test it as it is (you need to write code to prepare a mock object), and the argument x is intended If it is destroyed without it, the check is often missed.

In other words, if the effects of destructive operations are contained only inside the function, I don't think it is necessary to force the code into non-destructive code. For example, the following code does a destructive operation called ʻappend, but since the function f` has no external effect (no side effects), there is no problem in testing and reuse.

def f(x):
    matrix = []
    for i in range(x):
        row = []
        for j in range(x):
            if i == j:
                row.append(1)
            else:
                row.append(0)
        matrix.append(row)
    return matrix

You also need to be aware of side effects other than destructive operations. I will tell you each time because it seems to be long, but if you give me one piece of advice

  1. Read the required data (with side effects)
  2. Receive the data of ① and perform complicated numerical calculation (no side effects)
  3. Output the result of ② (with side effects)

I think it's good to implement it in the form of, because it makes it easier for data scientists to test and reuse the numerical calculation part that should really add value.

(Completely aside, some languages are forced to strictly separate pure functions from functions with side effects, and Land of Lisp's Cartoon of Side Effects Use /archives/51854832.html) laughs so much so please read it)

Summary

I summarized it. Please comment if you make a mistake.

Recommended Posts

All the destructive methods that data scientists should know
Tips (data structure) that you should know when programming competitive programming with Python2
[Data analysis] Should I buy the Harumi flag?