Python Pickle format notes

background

I want to exchange binary data between Python and C / C ++ for machine learning and ray tracing. I want to complete with only the standard functions of Python. For text, there are JSON and numpy text format (csv), but binaries are not easy to use on the C ++ side.

Consider Pickle serialization.

https://docs.python.org/ja/3/library/pickle.html

It seems that endianness is also taken into account.

information

The site that briefly explained Pickle's serialization format itself was not in English either.: Cry: (Once you know it, it's not that complicated format, so it may not be enough to explain ...)

However, thankfully, PyTorch JIT has serialization support with its own C ++ Pickle loader for implementing TorchScript (Python-like scripting language), and the code is helpful.

https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/docs/serialization.md

https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/serialization/pickler.h

You can also analyze the data with Python's Pickletools.

https://docs.python.org/ja/3.6/library/pickletools.html

format

Protocol version

Pickle has several Protocol versions. In Python3, 3 is the default, but when serialized in Python3 with proto 3, it cannot be read in Python2.

If you are mainly using numerical data and do not handle data that is not very strange, is proto 2 recommended? (TorchScript only supports proto 2)

The header will be 2 bytes of 0x80 (PROTO, 1 byte) and version number (1 byte).

Let's try serializing 1.

import pickle
import io

a = 1 

f = io.BytesIO()
b = pickle.dump(a, f)

w = open("bora.p", "wb")
w.write(f.getbuffer())
$ od -tx1c bora.p
0000000  80  03  4b  01  2e
        200 003   K 001   .
0000005

'K' is BININT1 . (2e) is STOP. The end of the data.

Looking at unpicker.cpp in pytorch jit,

    case PickleOpCode::BININT1: {
      uint8_t value = read<uint8_t>();
      stack_.emplace_back(int64_t(value));
    } break;

You can see that BININT1 is an int type value that can be serialized with 1 byte.

Try array data.

import pickle
import io

a = [1, 2] 

f = io.BytesIO()
b = pickle.dump(a, f, protocol=2)

w = open("bora.p", "wb")
w.write(f.getbuffer())

Now let's dump it with pickletools.

$ python -m pickletools bora.p 
    0: \x80 PROTO      2
    2: ]    EMPTY_LIST
    3: q    BINPUT     0
    5: (    MARK
    6: K        BININT1    1
    8: K        BININT1    2
   10: e        APPENDS    (MARK at 5)
   11: .    STOP
highest protocol among opcodes = 2

Basically, it is a combination of prefix + actual data, so after that, you should try various things by referring to pickler.cpp, unpickler.cpp and pickletools.py of pytorch jit and analyze it!

numpy array

Let's serialize the numpy array (ndarray).

a = numpy.array([1.0, 2.2, 3.3, 4, 5, 6, 7, 8, 9, 10], dtype=numpy.float32)

f = io.BytesIO()
b = pickle.dump(a, f, protocol=2)

w = open("bora.p", "wb")
w.write(f.getbuffer())
    0: \x80 PROTO      2
    2: c    GLOBAL     'numpy.core.multiarray _reconstruct'
   38: q    BINPUT     0
   40: c    GLOBAL     'numpy ndarray'
   55: q    BINPUT     1
   57: K    BININT1    0
   59: \x85 TUPLE1
   60: q    BINPUT     2
   62: c    GLOBAL     '_codecs encode'
   78: q    BINPUT     3
   80: X    BINUNICODE 'b'
   86: q    BINPUT     4
   88: X    BINUNICODE 'latin1'
   99: q    BINPUT     5
  101: \x86 TUPLE2
  102: q    BINPUT     6
  104: R    REDUCE
  105: q    BINPUT     7
  107: \x87 TUPLE3
  108: q    BINPUT     8
  110: R    REDUCE
  111: q    BINPUT     9
  113: (    MARK
  114: K        BININT1    1
  116: K        BININT1    10
  118: \x85     TUPLE1
  119: q        BINPUT     10
  121: c        GLOBAL     'numpy dtype'
  134: q        BINPUT     11
  136: X        BINUNICODE 'f4'
  143: q        BINPUT     12
  145: K        BININT1    0
  147: K        BININT1    1
  149: \x87     TUPLE3
  150: q        BINPUT     13
  152: R        REDUCE
  153: q        BINPUT     14
  155: (        MARK
  156: K            BININT1    3
  158: X            BINUNICODE '<'
  164: q            BINPUT     15
  166: N            NONE
  167: N            NONE
  168: N            NONE
  169: J            BININT     -1
  174: J            BININT     -1
  179: K            BININT1    0
  181: t            TUPLE      (MARK at 155)
  182: q        BINPUT     16
  184: b        BUILD
  185: \x89     NEWFALSE
  186: h        BINGET     3
  188: X        BINUNICODE '\x00\x00\x80?ÍÌ\x0c@33S@\x00\x00\x80@\x00\x00\xa0@\x00\x00À@\x00\x00à@\x00\x00\x00A\x00\x00\x10A\x00\x00 A'
  240: q        BINPUT     17
  242: h        BINGET     5
  244: \x86     TUPLE2
  245: q        BINPUT     18
  247: R        REDUCE
  248: q        BINPUT     19
  250: t        TUPLE      (MARK at 113)
  251: q    BINPUT     20
  253: b    BUILD
  254: .    STOP
highest protocol among opcodes = 2

You can see that the array data is stored as a byte string around BINUNICODE. After parsing the source code of numpy, it seems that you can load the pickle version of numpy array and pytorch tensor (you can imagine that it has a structure similar to numpy) with your own C ++ loader! (Numpy native? NPY / NPZ is somewhat concise in format, for example cnpy can read and write https://github.com/rogersce/cnpy)

TODO

Recommended Posts

Python Pickle format notes
Python scraping notes
Python study notes _000
Python learning notes
Python study notes_006
python C ++ notes
Python study notes _005
Python grammar notes
Python Library notes
Python string format
python personal notes
format in python
python pandas notes
Python study notes_001
python learning notes
Python3.4 installation notes
python variable expansion, format
missingintegers python personal notes
Python package development notes
python decorator usage notes
Python ipaddress package notes
Image format in Python
[Personal notes] Python, Django
[Python] format methodical use
[Python] pytest-mock Usage notes
First Python miscellaneous notes
Matlab => Python migration notes
Notes around Python3 assignments
Notes using Python subprocesses
Python try / except notes
Python framework bottle notes
Python notes using perl-ternary operator
Easily format JSON in Python
Python indentation and string format
O'Reilly python3 Primer Learning Notes
Web scraping notes in python3
Python standard unittest usage notes
Python notes to forget soon
python * args, ** kwargs Usage notes
Python notes using perl-special variables
Python 處 處 regular expression Notes
Python Tkinter notes (for myself)
Python data analysis learning notes
Notes on installing Python on Mac
[Python 2/3] Parse the format string
Sample usage of Python pickle
Format json with Vim (with python)
Get Evernote notes in Python
String format with Python% operator
Notes on installing Python on CentOS
Notes on Python and dictionary types
Python
Python Application: Data Handling Part 3: Data Format
Minimum grammar notes for writing Python
Notes on using MeCab from Python
Automatically format Python code in Vim
Handle NetCDF format data in Python
Handle GDS II format in Python
About Python pickle (cPickle) and marshal
Personal notes for python image processing
Python Pandas Data Preprocessing Personal Notes