The conclusion is as the title says. Also written in the official documentation.
A colleague said, "The assign method of pandas makes a copy of the data frame internally, so it's slow and troublesome because it consumes memory."
I'm addicted to reading "Recursion Substitution Eradication Committee for Python / pandas Data Processing" and writing statistic batch processing neatly using method chains. I did.
However, if you look at the actual pandas code,
#Comments and other methods are omitted
class DataFrame(NDFrame):
def insert(self, loc, column, value, allow_duplicates=False):
data = self.copy()
# do all calculations first...
results = {}
for k, v in kwargs.items():
results[k] = com._apply_if_callable(v, data)
# ... and then assign
for k, v in sorted(results.items()):
data[k] = v
return data
I thought, "What? Python's copy method is shallow copy in dictionaries and arrays?" But
Therefore, when using the copy method in a dictionary or array, the objects inside are the same, and copying the objects inside does not eat up memory.
a = {'a': [1, 2, 3]}
b = a.copy()
#The contents of a and b are the same
assert a['a'] is b['a']
#Destructive changes are spilling over!
a['a'].append(4)
print(b)
# => {'a': [1, 2, 3, 4]}
The copy method of pandas (and the assign method that uses it) seems to be better to worry about memory when dealing with huge data frames.
import pandas as pd
df_a = pd.DataFrame({'a': [1, 2, 3]})
df_b = df_a.copy()
#The contents of a and b are not the same!
assert df_a['a'] is not df_b['a']
Half a joke, I told my colleague that if this were Haskell, it wouldn't be a problem if the shallow copy didn't make any destructive changes.
This is a comment from a colleague. I would like to know if there is a writing style that is both easy to understand and saves memory.
[Caution for eradicating recursive substitution of pandas] I think that there are many cases where assign or pipe is used to avoid recursive assignment, but be aware that assign is a copy of df itself, so it will be much slower. On the other hand, the pipe is not copied, so it's okay
assign https://github.com/pandas-dev/pandas/blob/v0.20.3/pandas/core/frame.py#L2492 pipe https://github.com/pandas-dev/pandas/blob/v0.19.2/pandas/core/generic.py#L2698-L2708
However, I think it is true that assign is easy to read as ui, so
- Reduce the number of assigns as much as possible (it is okay to add multiple columns with one assign because copy does not occur)
- Shake off and recursively assign
I wonder if I can do it at best ...
By the way, I narrowed down the columns before passing it to assign, and tried concat to convert the returned dataframe to the original dataframe, but on the contrary it was considerably slower, so this is also not very good
Recommended Posts