[PYTHON] Explain the mechanism of PEP557 data class

TL;DR

What is data class?

dataclass is a new standard library added in python 3.7. Simply put, if you add a @ dataclass decorator to a class, you can read it as __init __, __repr__, __eq__, __hash__, so-called dunder (abbreviation for double underscore. ) A library that generates methods. It can be used to significantly reduce tedious class definitions and is faster than poor implementations. Since dataclass has various functions other than those introduced here, please refer to Official Document and [Python 3.7]. "Data Classes" may become the standard for class definitions](https://qiita.com/tag1216/items/13b032348c893667862a).

For those who still can't use python3.7, PyPI has a backport for 3.6.

How to use data class

from dataclasses import dataclass, field
from typing import ClassVar, List, Dict, Tuple
import copy

@dataclass
class Foo:
    i: int
    s: str
    f: float
    t: Tuple[int, str, float, bool]
    d: Dict[int, str]
    b: bool = False  #Default value
    l: List[str] = field(default_factory=list)  #default for list[]To
    c: ClassVar[int] = 10  #Class variables

#Generated`__init__`Instantiate with
f = Foo(i=10, s='hoge', f=100.0, b=True,
        l=['a', 'b', 'c'], d={'a': 10, 'b': 20},
        t=(10, 'hoge', 100.0, False))

#Generated`__repr__`Print out the string representation of h with
print(f)

#Make a copy and rewrite
ff = copy.deepcopy(f)
ff.l.append('d')

#Generated`__eq__`Compare with
assert f != ff

performance

I measured the execution time of DataclassFoo created using dataclass and ManualFoo written by hand, __init__, __repr__, __eq__.

Source code used for measurement
import timeit
from dataclasses import dataclass

@dataclass
class DataclassFoo:
    i: int
    s: str
    f: float
    b: bool

class ManualFoo:
    def __init__(self, i, s, f, b):
        self.i = i
        self.s = s
        self.f = f
        self.b = b
    def __repr__(self):
        return f'ManualFoo(i={self.i}, s={self.s}, f={self.f}, b={self.b})'
    def __eq__(self, b):
        a = self
        return a.i == b.i and a.s == b.s and a.f == b.f and a.b == b.b

def bench(name, f):
    times = timeit.repeat(f, number=100000, repeat=5)
    print(name + ':\t' +  f'{sum(t)/5:.5f}')

bench('dataclass __init__', lambda: DataclassFoo(10, 'foo', 100.0, True))
bench('manual class __init__', lambda: ManualFoo(10, 'foo', 100.0, True))

df = DataclassFoo(10, 'foo', 100.0, True)
mf = ManualFoo(10, 'foo', 100.0, True)
bench('dataclass __repr__', lambda: str(df))
bench('manual class __repr__', lambda: str(mf))

df2 = DataclassFoo(10, 'foo', 100.0, True)
mf2 = ManualFoo(10, 'foo', 100.0, True)
bench('dataclass __eq__', lambda: df == df2)
bench('manual class __eq__', lambda: mf == mf2)

Average of running 5 sets of 100,000 times each

Measurement result(sec)
dataclass __init__ 0.04382
Handwritten class__init__ 0.04003
dataclass __repr__ 0.07527
Handwritten class__repr__ 0.08414
dataclass __eq__ 0.04755
Handwritten class__eq__ 0.04593

It can be said that there is almost no difference if it is executed 500,000 times.

The bytecodes also matched.

dataclass \ _ \ _ init \ _ \ _
>>> import dis
>>> dis.dis(DataclassFoo.__init__)
  2           0 LOAD_FAST                1 (i)
              2 LOAD_FAST                0 (self)
              4 STORE_ATTR               0 (i)

  3           6 LOAD_FAST                2 (s)
              8 LOAD_FAST                0 (self)
             10 STORE_ATTR               1 (s)

  4          12 LOAD_FAST                3 (f)
             14 LOAD_FAST                0 (self)
             16 STORE_ATTR               2 (f)

  5          18 LOAD_FAST                4 (b)
             20 LOAD_FAST                0 (self)
             22 STORE_ATTR               3 (b)
             24 LOAD_CONST               0 (None)
             26 RETURN_VALUE
Handwritten class \ _ \ _ init \ _ \ _
>>> dis.dis(ManualFoo.__init__)
 13           0 LOAD_FAST                1 (i)
              2 LOAD_FAST                0 (self)
              4 STORE_ATTR               0 (i)

 14           6 LOAD_FAST                2 (s)
              8 LOAD_FAST                0 (self)
             10 STORE_ATTR               1 (s)

 15          12 LOAD_FAST                3 (f)
             14 LOAD_FAST                0 (self)
             16 STORE_ATTR               2 (f)

 16          18 LOAD_FAST                4 (b)
             20 LOAD_FAST                0 (self)
             22 STORE_ATTR               3 (b)
             24 LOAD_CONST               0 (None)
             26 RETURN_VALUE

Before going into the internal explanation of the data class

I would like to explain the important parts when explaining the data class.

PEP526: Syntax for Variable Annotations

PEP526 describes the method of type declaration, but the type information of the variable declared in class by this specification addition is described. It is now possible to get it when the program is executed.

from typing import Dict
class Player:
    players: Dict[str, Player]
    __points: int

print(Player.__annotations__)
# {'players': typing.Dict[str, __main__.Player],
#  '_Player__points': <class 'int'>}

Built-in ʻexec` function

I think many people know eval. Roughly speaking, the difference from eval is

ʻEval: Evaluate the argument string as an expression ʻExec: Evaluate the argument string as a statement

This alone doesn't make sense, so let's look at the next example.

It's easy to imagine that doing this will output "typing rocks!".

>>> exec('print("typing rocks!")')
"typing rocks!"

Then what is this?

exec('''
def func():
    print("typing rocks!")
''')

Then try this

>>> func()
"typing rocks!"

so. In fact, exec evaluates strings as expressions, so even python functions can be defined dynamically. Great.

So what is dataclass doing internally?

When a class with a dataclass decorator is imported, code is generated using the type annotations and exec described above. It's super rough, but the flow is as follows. For more information, read this part of the cpython source.

  1. The dataclass decorator is called on the class
  2. Get the type information (type name, type class, default value, etc.) of each field from type annotations
  3. Create a __init__ function definition ** string ** using type information
  4. Pass the string to ʻexec` to dynamically generate the function
  5. Set the __init__ function in the class

The code that simplifies 3, 4, and 5 looks like this.

nl = '\n'  # f-Since escaping cannot be used inside string, define it outside

#Function definition string creation
s = f"""
def func(self, {', '.join([f.name for f in fields(Hoge)])}):
{nl.join('  self.'+f.name+'='+f.name for f in fields(Hoge))}
"""

#Try to output the function definition string to the console
print(s)
# def func(self, i, s, f, t, d, b, l):
#   self.i=i
#   self.s=s
#   self.f=f
#   self.t=t
#   self.d=d
#   self.b=b
#   self.l=l

#Code generation with exec.`func`Function defined in scope
exec(s)

setattr(Foo, 'func', func)  #Set the function generated in the class in the class

The above is a simplified example, but in reality

  • Default value set in the field
  • Default factory function used for List etc.
  • ClassVar
  • Do not generate if programmer has defined
  • Generation of other dunder functions
  • Inheritance of class of dataclass

The function definition character string is created and the code is generated carefully so that it will operate correctly in any case.

Another thing to keep in mind is that this ** code generation only occurs the moment the module is loaded **. Once the class is imported, it can be used ** just like a handwritten class **.

Rust's # [derive]

Rust has a Derive attribute (# [derive]) that is added when defining a struct. This can be about the same as or better than the data class. For example, if you look at the following,

#[derive(Debug, Clone, Eq, PartialEq, Hash)]
struct Foo {
    i: i32,
    s: String,
    b: bool,
}

Just add # [derive (Debug, Clone, Eq, PartialEq, Hash)] and it will generate this many methods.

  • Method generation for Debug string generation (__repr__ in Python)
  • Method generation to clone an object
  • Comparison method generation (__eq__ and __gt__ in Python)
  • Hasher method generation (__hash__ in Python)

Rust is even better, with the ability to implement your own Custom derive officially supported, making it relatively casual. Allows type-based metaprogramming.

There are many other features in Rust that make these programmers easier, and I think that's why Rust is so productive, even with difficult type constraints and ownership. Rust is a really great language, so I encourage Pythonistas to try it out.

Possibility of dataclass as metaprogramming

I personally think that the dataclass is a good example of the usefulness and potential of type-based metaprogramming.

I also made about two libraries based on dataclass, so if you are interested, please take a look.

A library that maps environment variable values to dataclass fields. Useful when you want to override Python's config class with environment variables using a container

A dataclass-based serialization library. Under development to implement the same functionality as Rust's God Library serde using dataclass.

in conclusion

As with Rust, I hope Python will get excited about this area and come up with a lot of good libraries.

Recommended Posts