[PYTHON] Explain the mechanism of PEP557 data class

TL;DR

dataclass Very good: thumbsup:
It's not inferior to the handwritten class
A library based on dataclass is likely to come out in the future

What is `data class`?

dataclass is a new standard library added in python 3.7. Simply put, if you add a @ dataclass decorator to a class, you can read it as __init __, __repr__, __eq__, __hash__, so-called dunder (abbreviation for double underscore. ) A library that generates methods. It can be used to significantly reduce tedious class definitions and is faster than poor implementations. Since dataclass has various functions other than those introduced here, please refer to Official Document and [Python 3.7]. "Data Classes" may become the standard for class definitions](https://qiita.com/tag1216/items/13b032348c893667862a).

For those who still can't use python3.7, PyPI has a backport for 3.6.

How to use `data class`

from dataclasses import dataclass, field
from typing import ClassVar, List, Dict, Tuple
import copy

@dataclass
class Foo:
    i: int
    s: str
    f: float
    t: Tuple[int, str, float, bool]
    d: Dict[int, str]
    b: bool = False  #Default value
    l: List[str] = field(default_factory=list)  #default for list[]To
    c: ClassVar[int] = 10  #Class variables

#Generated`__init__`Instantiate with
f = Foo(i=10, s='hoge', f=100.0, b=True,
        l=['a', 'b', 'c'], d={'a': 10, 'b': 20},
        t=(10, 'hoge', 100.0, False))

#Generated`__repr__`Print out the string representation of h with
print(f)

#Make a copy and rewrite
ff = copy.deepcopy(f)
ff.l.append('d')

#Generated`__eq__`Compare with
assert f != ff

performance

I measured the execution time of DataclassFoo created using dataclass and ManualFoo written by hand, __init__, __repr__, __eq__.

macOS 10.14 Mojave
Intel 2.3GHz 8-core Intel Core i9
DDR4 32GB RAM
Python 3.6.3

Source code used for measurement import timeit from dataclasses import dataclass @dataclass class DataclassFoo: i: int s: str f: float b: bool class ManualFoo: def __init__(self, i, s, f, b): self.i = i self.s = s self.f = f self.b = b def __repr__(self): return f'ManualFoo(i={self.i}, s={self.s}, f={self.f}, b={self.b})' def __eq__(self, b): a = self return a.i == b.i and a.s == b.s and a.f == b.f and a.b == b.b def bench(name, f): times = timeit.repeat(f, number=100000, repeat=5) print(name + ':\t' + f'{sum(t)/5:.5f}') bench('dataclass __init__', lambda: DataclassFoo(10, 'foo', 100.0, True)) bench('manual class __init__', lambda: ManualFoo(10, 'foo', 100.0, True)) df = DataclassFoo(10, 'foo', 100.0, True) mf = ManualFoo(10, 'foo', 100.0, True) bench('dataclass __repr__', lambda: str(df)) bench('manual class __repr__', lambda: str(mf)) df2 = DataclassFoo(10, 'foo', 100.0, True) mf2 = ManualFoo(10, 'foo', 100.0, True) bench('dataclass __eq__', lambda: df == df2) bench('manual class __eq__', lambda: mf == mf2)

Average of running 5 sets of 100,000 times each Measurement result(sec) dataclass __init__ 0.04382 Handwritten class__init__ 0.04003 dataclass __repr__ 0.07527 Handwritten class__repr__ 0.08414 dataclass __eq__ 0.04755 Handwritten class__eq__ 0.04593 It can be said that there is almost no difference if it is executed 500,000 times. The bytecodes also matched. dataclass \ _ \ _ init \ _ \ _ >>> import dis >>> dis.dis(DataclassFoo.__init__) 2 0 LOAD_FAST 1 (i) 2 LOAD_FAST 0 (self) 4 STORE_ATTR 0 (i) 3 6 LOAD_FAST 2 (s) 8 LOAD_FAST 0 (self) 10 STORE_ATTR 1 (s) 4 12 LOAD_FAST 3 (f) 14 LOAD_FAST 0 (self) 16 STORE_ATTR 2 (f) 5 18 LOAD_FAST 4 (b) 20 LOAD_FAST 0 (self) 22 STORE_ATTR 3 (b) 24 LOAD_CONST 0 (None) 26 RETURN_VALUE Handwritten class \ _ \ _ init \ _ \ _ >>> dis.dis(ManualFoo.__init__) 13 0 LOAD_FAST 1 (i) 2 LOAD_FAST 0 (self) 4 STORE_ATTR 0 (i) 14 6 LOAD_FAST 2 (s) 8 LOAD_FAST 0 (self) 10 STORE_ATTR 1 (s) 15 12 LOAD_FAST 3 (f) 14 LOAD_FAST 0 (self) 16 STORE_ATTR 2 (f) 16 18 LOAD_FAST 4 (b) 20 LOAD_FAST 0 (self) 22 STORE_ATTR 3 (b) 24 LOAD_CONST 0 (None) 26 RETURN_VALUE Before going into the internal explanation of the data class I would like to explain the important parts when explaining the data class. PEP526: Syntax for Variable Annotations PEP526 describes the method of type declaration, but the type information of the variable declared in class by this specification addition is described. It is now possible to get it when the program is executed. from typing import Dict class Player: players: Dict[str, Player] __points: int print(Player.__annotations__) # {'players': typing.Dict[str, __main__.Player], # '_Player__points': <class 'int'>} Built-in ʻexec` function I think many people know eval. Roughly speaking, the difference from eval is ʻEval: Evaluate the argument string as an expression ʻExec: Evaluate the argument string as a statement This alone doesn't make sense, so let's look at the next example. It's easy to imagine that doing this will output "typing rocks!". >>> exec('print("typing rocks!")') "typing rocks!" Then what is this? exec(''' def func(): print("typing rocks!") ''') Then try this >>> func() "typing rocks!" so. In fact, exec evaluates strings as expressions, so even python functions can be defined dynamically. Great. So what is dataclass doing internally? When a class with a dataclass decorator is imported, code is generated using the type annotations and exec described above. It's super rough, but the flow is as follows. For more information, read this part of the cpython source. The dataclass decorator is called on the class Get the type information (type name, type class, default value, etc.) of each field from type annotations Create a __init__ function definition ** string ** using type information Pass the string to ʻexec` to dynamically generate the function Set the __init__ function in the class The code that simplifies 3, 4, and 5 looks like this. nl = '\n' # f-Since escaping cannot be used inside string, define it outside #Function definition string creation s = f""" def func(self, {', '.join([f.name for f in fields(Hoge)])}): {nl.join(' self.'+f.name+'='+f.name for f in fields(Hoge))} """ #Try to output the function definition string to the console print(s) # def func(self, i, s, f, t, d, b, l): # self.i=i # self.s=s # self.f=f # self.t=t # self.d=d # self.b=b # self.l=l #Code generation with exec.`func`Function defined in scope exec(s) setattr(Foo, 'func', func) #Set the function generated in the class in the class The above is a simplified example, but in reality Default value set in the field Default factory function used for List etc. ClassVar Do not generate if programmer has defined Generation of other dunder functions Inheritance of class of dataclass The function definition character string is created and the code is generated carefully so that it will operate correctly in any case. Another thing to keep in mind is that this ** code generation only occurs the moment the module is loaded **. Once the class is imported, it can be used ** just like a handwritten class **. Rust's # [derive] Rust has a Derive attribute (# [derive]) that is added when defining a struct. This can be about the same as or better than the data class. For example, if you look at the following, #[derive(Debug, Clone, Eq, PartialEq, Hash)] struct Foo { i: i32, s: String, b: bool, } Just add # [derive (Debug, Clone, Eq, PartialEq, Hash)] and it will generate this many methods. Method generation for Debug string generation (__repr__ in Python) Method generation to clone an object Comparison method generation (__eq__ and __gt__ in Python) Hasher method generation (__hash__ in Python) Rust is even better, with the ability to implement your own Custom derive officially supported, making it relatively casual. Allows type-based metaprogramming. There are many other features in Rust that make these programmers easier, and I think that's why Rust is so productive, even with difficult type constraints and ownership. Rust is a really great language, so I encourage Pythonistas to try it out. Possibility of dataclass as metaprogramming I personally think that the dataclass is a good example of the usefulness and potential of type-based metaprogramming. I also made about two libraries based on dataclass, so if you are interested, please take a look. A library that maps environment variable values to dataclass fields. Useful when you want to override Python's config class with environment variables using a container A dataclass-based serialization library. Under development to implement the same functionality as Rust's God Library serde using dataclass. in conclusion As with Rust, I hope Python will get excited about this area and come up with a lot of good libraries. Recommended Posts Explain the mechanism of PEP557 data class Explain the code of Tensorflow_in_ROS I investigated the mechanism of flask-login! The story of verifying the open data of COVID-19 Get the column list & data list of CASTable Let's investigate the mechanism of Kaiji's cee-loline The story of pep8 changing to pycodestyle Visualize the export data of Piyo log Japanese translation: PEP 20 --The Zen of Python I want to explain the abstract class (ABCmeta) of Python in detail. Explain the nature of the multivariate normal distribution graphically The story of reading HSPICE data in Python The transition of baseball as seen from the data Check the status of your data using pandas_profiling Download the wind data of the Japan Meteorological Agency Scraping the winning data of Numbers using Docker Use data class for data storage of Python 3.7 or higher Add the attribute of the object of the class with the for statement Find out the location of Python class definition files. [Statistics] Understand the mechanism of Q-Q plot by animation. A python implementation of the Bayesian linear regression class About the inefficiency of data transfer in luigi on-memory Not being aware of the contents of the data in python I tried using the API of the salmon data project Let's use the open data of "Mamebus" in Python [Ev3dev] Let's understand the mechanism of LCD (screen) control Understand the status of data loss --Python vs. R #We will automate the data aggregation of PES! part1 Decrease the class name of the detection result display of object detection Why is the first argument of [Python] Class self? Extract the band information of raster data with python The beginning of cif2cell parallelization of class method The meaning of self the zen of Python The story of sys.path.append () Preprocessing of prefecture data Explain the associative array Selection of measurement data Revenge of the Types: Revenge of types Try scraping the data of COVID-19 in Tokyo with Python Use of past weather data 4 (feelings of the weather during the Tokyo Olympics) Error handling after stopping the download of learned data of VGG16 A network diagram was created with the data of COVID-19. Let's utilize the railway data of national land numerical information Let's make the analysis of the Titanic sinking data like that Analyzing data on the number of corona patients in Japan The story of a Django model field disappearing from a class Data analysis based on the election results of the Tokyo Governor's election (2020) The story of rubyist struggling with python :: Dict data with pycall [Homology] Count the number of holes in data with Python Decoding experiment of the mechanism of public electric wiretapping by CIA Data processing that eliminates the effects of confounding factors (theory) The story of copying data from S3 to Google's TeamDrive [Python] I tried collecting data using the API of wikipedia What I saw by analyzing the data of the engineer market Play with the password mechanism of GitHub Webhook and Python [Data science memorandum] Confirmation of the contents of DataFrame type [python] I sent the data of Raspberry Pi to GCP (free) Try to extract the features of the sensor data with CNN

	Measurement result(sec)
dataclass __init__	0.04382
Handwritten class__init__	0.04003
dataclass __repr__	0.07527
Handwritten class__repr__	0.08414
dataclass __eq__	0.04755
Handwritten class__eq__	0.04593

[PYTHON] Explain the mechanism of PEP557 data class

What is data class?

How to use data class

performance

Before going into the internal explanation of the data class

Built-in ʻexec` function

So what is dataclass doing internally?

Rust's # [derive]

Possibility of dataclass as metaprogramming

in conclusion

What is `data class`?

How to use `data class`

Rust's `# [derive]`