[PYTHON] Summary of pickle and unpickle processing of user-defined class

Overview

pickle is Python's unique data serialization format and a very powerful mechanism, but the behavior behind it is less flexible and simpler than past history. Here, we summarize the process of pickling and unpickling non-main builtin classes (minor builtin classes, standard / non-standard libraries, user-defined classes, etc.), and how to efficiently pickle users. I summarized whether a definition class can be created.

The discussion here is based on Python 3.3.5 and Pickle Protocol Version 3. Protocol Version 4 was introduced from Python 3.4, but the internal processing has become more complicated, so I think it would be efficient to first understand it with Python 3.3 code.

Flow of pickle processing

Mainly, you can understand it by following the method below.

Lib/pickle.py


class _Pickler:
    def save(self, obj, save_persistent_id=True):
        ...
    def save_reduce(self, func, args, state=None,
                    listitems=None, dictitems=None, obj=None):
        ...
    def save_global(self, obj, name=None, pack=struct.pack):
        ...

Objects/typeobject.c


static PyObject *
reduce_2(PyObject *obj)
{
    ...
}

static PyObject *
object_reduce_ex(PyObject *self, PyObject *args)
{
    ...
}

1st step of pickle

When pickle.dump, pickle.dumps, etc. are called, everything is converted to pickle by the following processing.

sample1.py


pickler = pickle.Pickler(fileobj, protocol)
pickler.dump(obj)

The Pickler class is

  1. C implementation _pickle.Pickler, or
  2. Python implementation pickle._Pickler So, there are entities in the following places.
  3. static PyTypeObject Pickler_Type; defined in Modules / _pickler.c
  4. class _Pickler defined in Lib / pickle.py Normally, the C implementation is used preferentially, but if the import fails, the Python implementation is used. Since the main purpose here is to understand the mechanism, we will focus on the Python implementation.

Individual objects are recursively pickled by pickler.save (obj). First of all, the existing objects such as circular references and references in multiple places are appropriately pickled as forward references in the first half of this function.

For major builtin classes

Since the builtin classes and constants below are often used, Pickle implements its own efficient processing. For this reason, it does not correspond to the explanation in this paper and is omitted here. int, float, str, bytes, list, tuple, dict, bool, None For other classes, it will be pickled by the procedure shown below.

For class objects or functions

When the pickle target is a class object (that is, ʻis instance (obj, type) == True) or a function, ʻobj.__module__, obj.__name__ is recorded as a character string. In unpickle conversion, after importing the required module, the value that can be referred to by this variable name is unpickled. That is, only classes and functions defined in the module's global namespace can be pickled. Of course, the logic of functions and classes is not remembered, Python is not LISP.

For objects of classes registered in the copyreg module

Next, the existence of copyreg.dispatch_table [type (obj)] is checked from the dictionary globally defined in the copyreg module.

sample02.py


import copyreg
if type(obj) in copyreg.dispatch_table:
    reduce = copyreg.dispatch_table[type(obj)]
    rv = reduce(obj)

The contents of the return value rv will be described later.

In this way, the function registered in copyreg.dispatch_table has the highest priority and is used for pickleization. Therefore, even if the definition cannot be changed, the behavior of pickle / unpickle can be changed. In an extreme case, if you make a time object pickle / unpickle, you can make it a regular expression object.

sample03.py


import pickle
import copyreg
import datetime
import re

def reduce_datetime_to_regexp(x):
    return re.compile, (r'[spam]+',)

copyreg.pickle(datetime.datetime, reduce_datetime_to_regexp)

a = datetime.datetime.now()
b = pickle.loads(pickle.dumps(a))
print(a, b) # 2014-10-05 10:24:12.177959 re.compile('[spam]+')Output like

Addition to the dictionary dispatch_table is done viacopyreg.pickle (type, func).

If there is a dictionary pickler.dispatch_table, this will be used instead of copyreg.dispatch_table. This is safer if you want to change the behavior only when pickling for a specific purpose.

sample03a.py


import pickle
import copyreg
import datetime
import re
import io

def reduce_datetime_to_regexp(x):
    return re.compile, (r'[spam]+',)

a = datetime.datetime.now()

with io.BytesIO() as fp:
    pickler = pickle.Pickler(fp)
    pickler.dispatch_table = copyreg.dispatch_table.copy()
    pickler.dispatch_table[datetime.datetime] = reduce_datetime_to_regexp
    pickler.dump(a)
    b = pickle.loads(fp.getvalue())

print(a, b) # 2014-10-05 10:24:12.177959 re.compile('[spam]+')Output like

ʻObj.reduce_ex` is defined

If the method ʻobj.reduce_ex` is defined,

sample03.py


rv = obj.__reduce_ex__(protocol_version)

Is called. The contents of the return value rv will be described later.

ʻObj.reduce` is defined

If the method ʻobj.reduce` is defined,

sample03.py


rv = obj.__reduce__()

Is called. The contents of the return value rv will be described later.

Need for __reduce__

It seems that it is not the current situation. You should always use __reduce_ex__. This is searched first, so it will be a little faster. If you don't use the protocol variable, you can ignore it.

If you don't have any special definition

If no special method is written for pickle / unpickle, ʻobject standard reduce processing is performed as a last resort. This is, so to speak, "the most universal and greatest common divisor implementation of reduce_ex` that can be used as it is for most objects ", which is very helpful, but unfortunately it is implemented in C language and I understand it. difficult. If this part is omitted such as error handling and the general flow is implemented in Python, it will be as follows.

object_reduce_ex.py


class object:
    def __reduce_ex__(self, proto):
        from copyreg import __newobj__

        if hasattr(self, '__getnewargs__'):
            args = self.__getnewargs__()
        else:
            args = ()

        if hasattr(self, '__getstate__'):
            state = self.__getstate__()
        elif hasattr(type(self), '__slots__'):
            state = self.__dict__, {k: getattr(self, k) for k in type(self).__slots__}
        else:
            state = self.__dict__

        if isinstance(self, list):
            listitems = self
        else:
            listitems = None

        if isinstance(self, dict):
            dictitems = self.items()
        else:
            listitems = None

        return __newobj__, (type(self),)+args, state, listitems, dictitems

As you can see from the above, even if you rely on ʻobject.reduce_ex, you can change the behavior in detail by defining the methods of getnewargs, getstate. If you define reduce_ex, reduce` yourself, these functions will not be used unless you explicitly call them.

__getnewargs__ A method that returns tuples that can be pickled. Once this is defined, the arguments to __new__ in unpickleization (not __init__) can be customized. Does not include the first argument (class object).

__getstate__ If this is defined, the argument of __setstate__ in unpickleization, or __dict__ when __setstate__ does not exist, and the initial value of the slot can be customized.

Values that the __reduce_ex__, __reduce__ and copyreg registration functions should return

In the above process, the value rv that each function should return is

Is.

If type (rv) is str

type (obj) .__ module__, rv is recorded as a character string in pickle conversion, and in unpickle conversion, the module referenced by this name is returned after the module is properly imported. This mechanism can be effectively used when pickling a singleton object or the like.

If type (rv) is tuple

The tuple elements (2 or more and 5 or less) are as follows

  1. func --A pickleable and callable object (typically a class object) that creates an object when unpickled. However, in the case of func.__name__ ==" __newobj__ ", it will be described later with an exception.
  2. ʻargs--pickle A tuple of possible elements. Used as a parameter when callingfunc`.
  3. state-An object for unpickling the state of an object. Optional. It may be None.
  4. listitems --an iterable object that returns elements of a list-like object. Optional. It may be None.
  5. dictitems -- dict An iterable object that returns the keys and elements of an object. The value returned by the iterator must be a key / element pair. Typically dict_object.items (). Optional. It may be None.

For func.__name__ ==" __newobj__ "

In this case, ʻargs [0] is interpreted as a class object and a class object is created with ʻargs as an argument. At this time, __init__ is not called. If you need a func object with these conditions, there is one already declared in the copyreg module.

Lib/copyreg.py


def __newobj__(cls, *args):
    return cls.__new__(cls, *args)

This copyreg .__ newobj__ is implemented and entered so that it behaves in the same way even if it is interpreted as a normal function, but it is not actually executed.

Interpretation of the value of state

It is interpreted as follows.

  1. If the object to be unpickled has ʻobj.setstate`, the argument to that method.
  2. For element 2 tuples, state [0] is a dictionary that indicates the contents of ʻobj.items, and state [1]is a dictionary that indicates the contents oftype (obj) .__ slots__. Both may be None`.
  3. For a single dictionary, the contents of ʻobj.items`

Unpickle process flow

Mainly, you can understand it by following the method below.

Lib/pickle.py


class _Unpickler:
    def load_newobj(self):
        ...
    def load_reduce(self):
        ...
    def load_build(self):
        ...
    def load_global(self):
        ...

1st step of unpickle

When pickle.load, pickle.loads, etc. are called, all are unpickled by the following processing.

sample1.py


unpickler = pickle.Unpickler(fileobj)
unpickler.load()

The Unpickler class is

  1. C implementation _pickle.Unpickler, or
  2. Python implementation pickle._Unpickler So, there are entities in the following places.
  3. static PyTypeObject Unpickler_Type; defined in Modules / _pickler.c
  4. class _Unpickler defined in Lib / pickle.py

The object is restored while sequentially calling ʻunpickler.load_xxx ()` according to the ID called opcode according to the element in the pickle data.

Unpickle global opcode data

In cases where a class, function, or __reduce_ex__ returns a string, the string"modulename.varname"is recorded as is. In this case, import the module if necessary and output the corresponding value. No new object is created by unpickler.

unpickle newobj, reduce, build opcode data

When pickled using a tuple of 5 elements returned by __reduce_ex__ etc., the object is unpickled by these processes. If you rewrite the outline of each method of load_newobj, load_reduce, load_build corresponding to this process in a simple flow, it will be as follows.

sample09.py


def unpickle_something():
    func, args, state, listitems, dictitems = load_from_pickle_stream()

    if getattr(func, '__name__', None) == '__newobj__':
        obj = args[0].__new__(*args)
    else:
        obj = func(*args)

    if lisitems is not None:
        for x in listitems:
            obj.append(x)

    if dictitems is not None:
        for k, v in dictitems:
            obj[k] = v

    if hasattr(obj, '__setstate__'):
        obj.__setstate__(state)
    elif type(state) is tuple and len(state) == 2:
        for k, v in state[0].items():
            obj.__dict__[k] = v
        for k, v in state[1].items():
            setattr(obj, k, v)
    else:
        for k, v in state.items():
            obj.__dict__[k] = v

    return obj

Case Study

Case where you do not have to do anything

Cases that satisfy the following conditions can be processed appropriately without writing the pickle and unpickle processes.

  1. The contents of all __dict__ can be pickled, and there is no problem even if they are restored as they are.
  2. The value of the attribute corresponding to all __slots__ can be pickled, and there is no problem even if it is restored as it is.
  3. Due to the C language implementation, it does not have internal data inaccessible from Python.
  4. No processing to interpret the argument is added to __new__.
  5. Even if __init__ is not called, there is no contradiction as an object if the attributes are restored correctly.
  6. In the case of subclasses of list and dict, all the elements can be pickled and restored as they are without any problem.

Objects with attributes that you do not want to include in pickle (cache, etc.) or attributes that cannot be pickle

sphere0.py


import pickle

class Sphere:
    def __init__(self, radius):
        self._radius = radius
    @property
    def volume(self):
        if not hasattr(self, '_volume'):
            from math import pi
            self._volume = 4/3 * pi * self._radius ** 3
        return self._volume

def _main():
    sp1 = Sphere(3)
    print(sp1.volume)
    print(sp1.__reduce_ex__(3))
    sp2 = pickle.loads(pickle.dumps(sp1))
    print(sp2.volume)

if __name__ == '__main__':
    _main()

When the Shere object that represents a sphere accesses the volume property that represents the volume, the calculation result is cached internally. If this is pickled as it is, the cached volume will be saved together, and the data third will increase. I want to delete this.

sphere1.py


class Sphere:
    def __init__(self, radius):
        self._radius = radius
    @property
    def volume(self):
        if not hasattr(self, '_volume'):
            from math import pi
            self._volume = 4/3 * pi * self._radius ** 3
        return self._volume
    def __getstate__(self):
        return {'_radius': self._radius}

You can prevent the cache from being pickled by defining a __getstate__ method that returns the value of __dict __ after unpickle.

sphere2.py


class Sphere:
    __slots__ = ['_radius', '_volume']
    def __init__(self, radius):
        self._radius = radius
    @property
    def volume(self):
        if not hasattr(self, '_volume'):
            from math import pi
            self._volume = 4/3 * pi * self._radius ** 3
        return self._volume
    def __getstate__(self):
        return None, {'_radius': self._radius}

To improve memory efficiency, if you define __slots__, the value returned by __getstate__ must be changed because __dict __ no longer exists. In this case, it is a two-element tuple, and the latter element is a dictionary that initializes the attributes of __slots__. The previous element (initial value of __dict__) can be None.

sphere3.py


class Sphere:
    __slots__ = ['_radius', '_volume']
    def __init__(self, radius):
        self._radius = radius
    @property
    def volume(self):
        if not hasattr(self, '_volume'):
            from math import pi
            self._volume = 4/3 * pi * self._radius ** 3
        return self._volume
    def __getstate__(self):
        return self._radius
    def __setstate__(self, state):
        self._radius = state

If the only value to be pickled is the radius, you can return the self._radius value itself as __getstate__ instead of the dictionary. In that case, also define a pair of __setstate__.

Objects that cannot be created without giving appropriate arguments to __new__

intliterals.py


import pickle

class IntLiterals(tuple):
    def __new__(cls, n):
        a = '0b{n:b} 0o{n:o} {n:d} 0x{n:X}'.format(n=n).split()
        return super(cls, IntLiterals).__new__(cls, a)
    def __getnewargs__(self):
        return int(self[0], 0),

def _main():
    a = IntLiterals(10)
    print(a) # ('0b1010', '0o12', '10', '0xA')
    print(a.__reduce_ex__(3))
    b = pickle.loads(pickle.dumps(a))
    print(b)

if __name__ == '__main__':
    _main()

Objects that cannot be created without calling __init__

closureholder.py


import pickle

class ClosureHolder:
    def __init__(self, value):
        def _get():
            return value
        self._get = _get
    def get(self):
        return self._get()
    def __reduce_ex__(self, proto):
        return type(self), (self.get(),)

def _main():
    a = ClosureHolder('spam')
    print(a.get())
    print(a.__reduce_ex__(3))
    b = pickle.loads(pickle.dumps(a))
    print(b.get())

if __name__ == '__main__':
    _main()

The value returned by get is stored by the closure in __init__, so the object cannot be created without calling __init__. In such a case, ʻobject.reduce_excannot be used, so implementreduce_ex` by yourself.

Singleton object

singleton.py


class MySingleton(object):
    def __new__(cls, *args, **kwds):
        assert mysingleton is None, \
            'A singleton of MySingleton has already been created.'
        return super(cls, MySingleton).__new__(cls, *args, **kwds)
    def __reduce_ex__(self, proto):
        return 'mysingleton'

mysingleton = None
mysingleton = MySingleton()

def _main():
    import pickle
    a = pickle.dumps(mysingleton)
    b = pickle.loads(a)
    print(b)

if __name__ == '__main__':
    _main()

Suppose the MySingleton class has only one instance in the mysingleton global variable. To unpickle this correctly, use a format in which __reduce_ex__ returns a string.

Recommended Posts

Summary of pickle and unpickle processing of user-defined class
Summary of date processing in Python (datetime and dateutil)
Calculation of homebrew class and existing class
Summary of Python indexes and slices
Summary of multi-process processing of script language
Example of using class variables and class methods
Answers and impressions of 100 language processing knocks-Part 2
Comparison of class inheritance and constructor description
Correspondence summary of array operation of ruby and python
Summary of OSS tools and libraries created in 2016
Summary of the differences between PHP and Python
Installation of Python3 and Flask [Environment construction summary]
I / O related summary of python and fortran
[Python] Class type and usage of datetime module
Page processing class
[Language processing 100 knocks 2020] Summary of answer examples by Python
Full-width and half-width processing of CSV data in Python
Python --Explanation and usage summary of the top 24 packages
[Python] Type Error: Summary of error causes and remedies for'None Type'
[Competition Pro] Summary of stock buying and selling problems
Python asynchronous processing ~ Full understanding of async and await ~
[Kaggle] Summary of pre-processing (statistics, missing value processing, etc.)
Sample of getting module name and class name in Python
Overview of class-based generic views and inherited class relationships
Overview of natural language processing and its data preprocessing