Use data class for data storage of Python 3.7 or higher

Introduction

Are you using a dictionary or ordinary class to store data in Python? Starting with Python 3.7, there is a dataclass decorator that is useful for storing data.

In this article, I will explain how to use it, touching on when it is convenient and why it should be used, which cannot be grasped by the explanation of Official Document and PEP557.

In previous versions, only Python 3.6 can be used by pip install data classes. At the time of writing, the environment of Google Colaboratory is Python 3.6.9, but data classes are installed by default.

Assumed reader

--People who know the existence of dataclass but don't know what it is --People who want to handle data with high readability ――People who think "I didn't have this function before, and I don't have to use it separately ..."

The minimum explanation you often see

↓ This is

class Person:
    def __init__(self, number, name='XXX'):
        self.number = number
        self.name = name

person1 = Person(0, 'Alice')
print(person1.number) # 0
print(person1.name) # Alice

↓ You can write like this. (The class name is explicitly changed for distinction)

import dataclasses
@dataclasses.dataclass
class DataclassPerson:
    number: int
    name: str = 'XXX'
        
dataclass_person1 = DataclassPerson(0, 'Alice')
print(dataclass_person1.number) # 0
print(dataclass_person1.name) # Alice

You can use it by adding the decorator @ dataclasses.dataclass and writing the variable name you want to define instead of__ init__ ()with type annotations.

__init__ () is created automatically, and type annotation is required.

What has changed is that you no longer have to bother to assign ** arguments to instance variables with __init __ (). ** It means that __init __ () is created automatically. ** It's not a hassle when there are a lot of variables, and I'm happy that it's refreshing. ** Also, other special methods such as __eq__ () and __repr__ () are created automatically, as described below.

And since type annotation is mandatory, I'm happy to know the type. (However, this is where you want to set def __init __ (self, number: int, name: str ='XXX') even in a normal class)

** It can be clearly stated that this class exists to store data **, which is also an important factor in terms of readability.

I want to avoid dictionaries

If you just want to do the above example, you can use a dictionary. Why bother to use a class, let alone a dataclass decorator? It seems that there are many people who use a dictionary for input and output for the time being.

dict_person1 = {'number': 0, 'name': 'Alice'}
print(dict_person1['number']) # 0
print(dict_person1['name']) # Alice

What are the disadvantages of dictionaries that are easy to understand?

  1. Dot access is not possible. (However, it may be okay if you can't do it)
  2. Methods such as storage processing cannot be included.
  3. Type annotation is not possible.
  4. It is difficult to grasp from the code that it has a fixed shape.

3 and 4 are important for aiming for code that is easy to read and maintain later, which is a reason to avoid dictionaries even if you don't need methods. However, these can also be covered in regular classes.

Benefits of data class

Let's take a deep dive into how a class with the dataclass decorator is better than a regular class.

Advantages: __eq__ () is automatically created and unittest is easy.

When comparing instances, in a normal class, instances with the same contents but different contents will be False. This is because we are comparing the values ​​returned by id (), which is not very useful. ** Considering doing a unit test, I want it to be True when the elements match. ** **

↓ If you do nothing in a normal class, it will be like this.

class Person:
    def __init__(self, number, name='XXX'):
        self.number = number
        self.name = name

person1 = Person(0, 'Alice')

print(person1 == Person(0, 'Alice')) # False
print(person1 == Person(1, 'Bob')) # False

↓ In order to compare elements in a normal class, you will have to define __eq__ () yourself.

class Person:
    def __init__(self, number, name='XXX'):
        self.number = number
        self.name = name
        
    def __eq__(self, other):
        if not isinstance(other, Person):
            return NotImplemented
        return self.number == other.number and self.name == other.name

person1 = Person(0, 'Alice')

print(person1 == Person(0, 'Alice')) # True
print(person1 == Person(1, 'Bob')) # False

↓ If you use the dataclass decorator, this __eq__ () will be created automatically. It saves time and looks neat.

@dataclasses.dataclass
class DataclassPerson:
    number: int
    name: str = 'XXX'
        
dataclass_person1 = DataclassPerson(0, 'Alice')

print(dataclass_person1 == DataclassPerson(0, 'Alice')) # True
print(dataclass_person1 == DataclassPerson(1, 'Bob')) # False

Also, if @ dataclasses.dataclass (order = True) is set, __lt__ (), __le__ (), __gt__ (), and __ge__ () are also created for the operation of magnitude comparison. I will. These are specifications that first compare different elements, just like when comparing tuples. It's a little confusing, so you might want to define it yourself if you need it.

Advantage: You can use asdict to convert it into a dictionary even if it is nested.

Use dataclasses.asdict () when you want to convert to a dictionary, such as when you want to output as JSON. It doesn't matter if you nest the dataclass.

@dataclasses.dataclass
class DataclassScore:
    writing: int
    reading: int
    listening: int
    speaking: int
        
@dataclasses.dataclass
class DataclassPerson:
    score: DataclassScore
    number: int
    name: str = 'Alice'
        
dataclass_person1 = DataclassPerson(DataclassScore(25, 40, 30, 35), 0, 'Alice')
dict_person1 = dataclasses.asdict(dataclass_person1)
print(dict_person1) # {'score': {'writing': 25, 'reading': 40, 'listening': 30, 'speaking': 35}, 'number': 0, 'name': 'Alice'}

import json
print(json.dumps(dict_person1)) # '{"score": {"writing": 25, "reading": 40, "listening": 30, "speaking": 35}, "number": 0, "name": "Alice"}'

Even a normal class can be converted to a dictionary format by using __dict__, but it takes some effort when nested.

When returning from the dictionary to the class, use unpack and do as follows.

DataclassPerson(**dict_person1)

Benefits: Easy to immutable

You can easily make it immutable using the data class. By making immutable data that will not be rewritten, you can avoid the anxiety that it may have changed somewhere.

↓ It is mutable if nothing is specified,

@dataclasses.dataclass
class DataclassPerson:
    number: int
    name: str = 'XXX'
        
dataclass_person1 = DataclassPerson(0, 'Alice')
print(dataclass_person1.number) # 0
print(dataclass_person1.name) # Alice

dataclass_person1.number = 1
print(dataclass_person1.number) # 1

↓ If you set frozen = True in the decorator argument, it will be immutable. At this time, __hash__ () is automatically created, and you can also use hash () to get the hash value.

@dataclasses.dataclass(frozen=True)
class FrozenDataclassPerson:
    number: int
    name: str = 'Alice'
    
frozen_dataclass_person1 = FrozenDataclassPerson(number=0, name='Alice')
print(frozen_dataclass_person1.number) # 0
print(frozen_dataclass_person1.name) # Alice
print(hash(frozen_dataclass_person1)) # -4135290249524779415

frozen_dataclass_person1.number = 1 # FrozenInstanceError: cannot assign to field 'number'

What is different from the named tuple that can be immutable

There are also standard libraries such as the following for applications that you want to make immutable.

By using these, you can create tuples (= immutable objects) that allow dot access.

from collections import namedtuple

CollectionsNamedTuplePerson = namedtuple('CollectionsNamedTuplePerson', ('number' , 'name'))

collections_namedtuple_person1 = CollectionsNamedTuplePerson(number=0, name='Alice')
print(collections_namedtuple_person1.number) # 0
print(collections_namedtuple_person1.name) # Alice
print(collections_namedtuple_person1 == (0, 'Alice')) # True

collections_namedtuple_person1.number = 1 # AttributeError: can't set attribute

↓ Furthermore, typing.NamedTuple can also type annotation.

from typing import NamedTuple

class NamedTuplePerson(NamedTuple):
    number: int
    name: str = 'XXX'

namedtuple_person1 = NamedTuplePerson(0, 'Alice')
print(namedtuple_person1.number) # 0
print(namedtuple_person1.name) # Alice
print(typing_namedtuple_person1 == (0, 'Alice')) # True

namedtuple_person1.number = 1 # AttributeError: can't set attribute

For more information Write beautiful python with namedtuple! (Translation) --Qiita is easy to understand.

dataclass and typing.NamedTuple are similar, but different in detail. As shown in the code above, it seems to be a disadvantage to be True when compared with tuples that have the same elements.

One of the more convenient features of typing.NamedTuple is that it is a tuple, so you can do unpacked assignments. Depending on the usage, it may be better to force it into a data class.

Various functions

Since __repr__ () is created, you can easily check the contents.

Since __repr__ () is created automatically, you can easily check the contents with print () etc.

@dataclasses.dataclass
class DataclassPerson:
    number: int
    name: str = 'XXX'
        
dataclass_person1 = DataclassPerson(0, 'Alice')
print(dataclass_person1) # DataclassPerson(number=0, name='Alice')

If you want to have the same display in a normal class, you need to write the following.

class Person:
    def __init__(self, number, name='XXX'):
        self.number = number
        self.name = name

    def __repr__(self):
        return f'{self.__class__.__name__}({", ".join([f"{key}={value}" for key, value in self.__dict__.items()])})' 
    
person1 = Person(0, 'Alice')
print(person1) # Person(number=0, name=Alice)

You can write the post-initialization process with __post_init__ ()

Use __post_init__ () when you are doing something other than assignment with the normal class __init __ (). This method will be called after the assignment. Also, use dataclasses.field (init = False) to create an instance variable that is not passed as an argument.

@dataclasses.dataclass
class DataclassPerson:
    number: int
    name: str = 'XXX'
    is_even: bool = dataclasses.field(init=False)
    
    def __post_init__(self):
        self.is_even = self.number%2 == 0
        
dataclass_person1 = DataclassPerson(0, 'Alice')
print(dataclass_person1.number) # 0
print(dataclass_person1.name) # Alice
print(dataclass_person1.is_even) # True

You can pass initialization arguments with InitVar

As in the example below, there may be values ​​that you want to pass as arguments at initialization but don't want to be instance variables.

class Person:
    def __init__(self, number, name='XXX'):
        self.name = name
        self.is_even = number%2 == 0

person1 = Person(0, 'Alice')
print(person1.name) # Alice
print(person1.is_even) # True

In that case, use InitVar.

@dataclasses.dataclass
class DataclassPerson:
    number:  dataclasses.InitVar[int]
    name: str = 'XXX'
    is_even: bool = dataclasses.field(init=False)
    
    def __post_init__(self, number):
        self.is_even = number%2 == 0
        
dataclass_person1 = DataclassPerson(0, 'Alice')
print(dataclass_person1.name) # Alice
print(dataclass_person1.is_even) # True

at the end

Since it is an Advent calendar less than a year after joining the company, it tends to be good for individual development, but it was an introduction of the parts that I want to cherish for team development.

It's convenient to use, but it's easy to neglect to catch up on features that can be managed without using them, but there are reasons to add new features. The atmosphere of recent Python has changed considerably from a few years ago, with the introduction of type annotations. There may be likes and dislikes, but first of all, I can't think of anything I don't know, so I want to make sure I don't leave it behind!

References

dataclasses --- Data Classes — Python 3.9.1 Documentation PEP 557 -- Data Classes | Python.org


Notice

If you read this article and thought it was "interesting" or "learned", please leave a comment on Twitter, facebook, or Hatena Bookmark!

In addition, DeNA Official Twitter Account @DeNAxTech publishes not only blog articles but also presentation materials at various study sessions. Please follow us!

Recommended Posts

Use data class for data storage of Python 3.7 or higher
How to use "deque" for Python data
Use OpenSeesPy regardless of OS or Python version
python: Use your own class for numpy ndarray
List of Python libraries for data scientists and data engineers
Use urlparse.urljoin instead of os.path.join for Python URL joins
Let's use the open data of "Mamebus" in Python
What to use for Python stacks and queues (speed comparison of each data structure)
Python for Data Analysis Chapter 4
Python for Data Analysis Chapter 2
Python for Data Analysis Chapter 3
Python application: Data cleansing # 3: Use of OpenCV and preprocessing of image data
Which should I study, R or Python, for data analysis?
[Python] Extract text data from XML data of 10GB or more.
Python code for writing CSV data to DSX object storage
How to change python version of Notebook in Watson Studio (or Cloud Pak for Data)
Recommended books and sources of data analysis programming (Python or R)
Wrap (part of) the AtCoder Library in Cython for use in Python
[Python of Hikari-] Chapter 09-03 Class (inheritance)
Proper use of Python visualization packages
Python course for data science_useful techniques
Next, use Python (Flask) for Heroku!
Install Networkx in Python 3.7 environment for use in malware data science books
[Python Queue] Convenient use of Deque
Survey for practical use of BlockChain
Preprocessing template for data analysis (Python)
Use Azure Blob Storage from Python
Data formatting for Python / color plots
Beginners use Python for web scraping (1)
Introductory table of contents for python3
[Python] How to use the for statement. A method of extracting by specifying a range or conditions.
Beginners use Python for web scraping (4) ―― 1
Record of Python introduction for newcomers
[Python of Hikari-] Chapter 05-09 Control syntax (use of for statement and while statement properly)
Memo of pixel position operation for image data in Python (numpy, cv2)
About Python code for simple moving average assuming the use of Numba
A summary of Python e-books that are useful for free-to-read data analysis
Use edump instead of var_dump for easy debugging & efficient data content (PHP)
[Python] It was very convenient to use a Python class for a ROS program.
How to access data with object ['key'] for your own Python class
Get the key for the second layer migration of JSON data in python
Explain the mechanism of PEP557 data class
[Python] Summary of how to use pandas
[Introduction to Python] How to use class in Python?
[Python] Use and and or when creating variables
[Python] Minutes of study meeting for beginners (7/15)
Python visualization tool for data analysis work
Use DeepL with python (for dissertation translation)
[Learning memo] Basics of class by python
Use PostgreSQL data type (jsonb) from Python
Summary of various for statements in Python
[Python] Use string data with scikit-learn SVM
Connect a lot of Python or and and
[Python] Organizing how to use for statements
Detailed Python techniques required for data shaping (1)
[Python2.7] Summary of how to use unittest
Pandas of the beginner, by the beginner, for the beginner [Python]
How to use __slots__ in Python class
Recommendation of Altair! Data visualization with Python
Summary of useful techniques for Python Scrapy
Construction of development environment for Choreonoid class