TL;DR
xarray is a Python package used in data analysis like NumPy and pandas as a tool for handling data with metadata (axis labels, etc.) attached to a multidimensional array, but various data As I deal with xarray (DataArray), I feel the following.
--I want an array generation function that specifies the axes (dimensions) and type (dtype) of a multidimensional array. --Similarly, I want to specify the axis, type, and default value for the metadata (coordinates). --I want to generate a special array like ʻones () `of NumPy that satisfies the above two. --I want to define data-specific processing (method)
There are many ways to achieve these, but are the data classes (dataclasses) that appeared in the standard library from Python 3.7 the same problem? I noticed that I solved it with a simple writing style. Therefore, we have released a package "xarray-custom" for creating a user-defined DataArray class in the same way as a data class (you can install it with pip).
from xarray_custom import ctype, dataarrayclass
@dataarrayclass(accessor='img')
class Image:
"""DataArray class to represent images."""
dims = 'x', 'y'
dtype = float
x: ctype('x', int) = 0
y: ctype('y', int) = 0
def normalize(self):
return self / self.max()
Below, I will explain how this code works, touching on Python data classes and DataArary.
Python's dataclass
First, Python data classes are a feature (class decorator) that makes it easy to create user-defined data structures.
from dataclasses import dataclass
@dataclass
class Coordinates:
x: float
y: float
@classmethod
def origin(cls):
return cls(0.0, 0.0)
def norm(self):
return (self.x ** 2 + self.y **2) ** 0.5
If you define a class like this,
coord = Coordinates(x=3, y=4)
It creates a class that stores data like (Originally, it is necessary to implement various special methods such as __init__ ()
). I think the following are possible advantages of defining a data class in this way.
--The value to be stored can have a name, type (type hint only), and default value.
--Special data generation (ʻorigin () in the above example) can be realized by class method --Data-specific processing (
norm ()` in the above example) can be realized by instance method
xarray's DataArray
Next, let's look at the data structure of xarray. A DataArray takes a data structure consisting of a NumPy multidimensional array (data), axes (dimensions; a type of metadata), and metadata (coordinates). The following example intends to represent the data in a DataArray of a monochromatic image consisting of two axes of xy.
from xarray import DataArray
image = DataArray(data=[[0, 1], [2, 3]], dims=('x', 'y'),
coords={'x': [0, 1], 'y': [0, 1]})
print(image)
# <xarray.DataArray (x: 2, y: 2)>
# array([[0., 1.],
# [2., 3.]])
# Coordinates:
# * x (x) int64 0 1
# * y (y) int64 0 1
Since DataArray
is a class, a common way to define a user-defined DataArray is to create a new class with DataArray
as a subclass. However, if you hard code the definition in __init__ ()
etc., there is a problem that it cannot be reused. In addition, xarray and pandas do not actively recommend subclassing in the first place, and it is better to implement user-defined processing etc. via a special object called accessor.
One standard solution to this problem is to subclass Dataset and/or DataArray to add domain specific functionality. However, inheritance is not very robust. It’s easy to inadvertently use internal APIs when subclassing, which means that your code may break when xarray upgrades. Furthermore, many builtin methods will only return native xarray objects. (Extending xarray - xarray 0.15.0 documentation)
xarray's dataarrayclass
It's finally the main subject. In order to realize the request mentioned at the beginning, considering the circumstances of xarray (DataArray),
-Provides a method to generate an array without hard-coding the (meta) data definition. --Provide user-defined processing via accessors —— Make these simple to write
I realized that I needed something. So, in xarray-custom, I decided to get used to the Python data class and implement it by ** dynamically modifying the user-defined DataArray class with a class decorator (dataarrayclass) **. Using xarray-custom, the above monochromatic image example can be defined as follows.
from xarray_custom import ctype, dataarrayclass
@dataarrayclass(accessor='img')
class Image:
"""DataArray class to represent images."""
dims = 'x', 'y'
dtype = float
x: ctype('x', int) = 0
y: ctype('y', int) = 0
def normalize(self):
return self / self.max()
By knowing the data class, you might have somehow understood what you're doing with this code without explanation. The points here are the following three points.
--Specify the data axis ('x','y'
) and type (float
) with class variables
--Specify the name, axis, type, and default value of the metadata including the axis in a specially typed class variable.
--User-defined processing (method) is automatically moved to the accessor (ʻimg`).
Where ctype
is a function to generate a special type (class) that defines the metadata. Now let's actually generate a user-defined array.
image = Image([[0, 1], [2, 3]], x=[0, 1], y=[0, 1])
print(image)
# <xarray.DataArray (x: 2, y: 2)>
# array([[0., 1.],
# [2., 3.]])
# Coordinates:
# * x (x) int64 0 1
# * y (y) int64 0 1
You can see that the axes and metadata are predefined, so it's much simpler to write than the DataArray example above. It differs from a data class in that the type information is actually used to determine the type of the array. In the above example, the list of integers is type converted to float in the DataArray. If (implicit) type conversion is not possible, ValueError
will be thrown. You can also leave the type unspecified (in which case it will accept objects of any type).
Instance methods via an accessor
According to the policy of xarray, the instance method defined in the class (normalize ()
in the above example) can be executed via the accessor. Running without an accessor will throw ʻAttribute Error`.
normalized = image.img.normalize()
print(normalized)
# <xarray.DataArray (x: 2, y: 2)>
# array([[0. , 0.33333333],
# [0.66666667, 1. ]])
# Coordinates:
# * x (x) int64 0 1
# * y (y) int64 0 1
Special functions as class methods
As a special array generation, zeros ()
・ ʻones () ・ ʻempty ()
・ full ()
, which is familiar with NumPy, is automatically added as a class method. zeros_like ()
etc. are defined in the top-level function of xarray, so let's use that.
uniform = Image.ones((2, 3))
print(uniform)
# <xarray.DataArray (x: 2, y: 3)>
# array([[1., 1., 1.],
# [1., 1., 1.]])
# Coordinates:
# * x (x) int64 0 0
# * y (y) int64 0 0 0
Misc
So far, isn't the dataarray class
also subclassing the DataArray
after all? Some people may have thought. However, the DataArray that is actually generated is a real DataArray
instance. Conversely, in the above example it is not a ʻImage` instance.
print(type(image)) # xarray.core.dataarray.DataArray
This is because the dataarray class
internally generates__new__ ()
instead of__ init__ ()
, and__new__ ()
returns a DataArray. Therefore, it should be noted that class variables etc. cannot be accessed like an instance created from an ordinary class.
I have the impression that xarray has fewer articles and is less well known than NumPy and pandas, but it is definitely useful for tasks dealing with multidimensional arrays, so I would like to use it more and more. xarray-custom has few functions in the early stages of development, but I hope to contribute to the community as much as possible through the development and articles of these extensions.
References
Recommended Posts