This article introduces the Python library xarray, which supports multidimensional data analysis. For more information, please refer to Information of the head family.

Scientific measurement data is often multidimensional. For example, when measuring time series data with sensors installed at multiple positions, The measurement data is two-dimensional data in the spatial channel direction x time direction. Furthermore, when short-time Fourier transform is applied to the data, it becomes three-dimensional data in the spatial channel direction × time direction × frequency direction.

In general, when dealing with such data, I think that you often use numpy's np.ndarray. However, since np.ndarray is a simple matrix (or tensor), other information needs to be set aside. In the above example,

- Dimensional order: The first dimension of the two-dimensional data corresponds to the spatial channel, and the second dimension corresponds to time.
- Coordinates of each dimension

Etc. correspond to "other information" here.

Therefore, for example, if you want to cut out a part of a certain time range from the data in it and use it. In addition to the cut out data, it is necessary to cut out the time axis data at the same time.

Of course, you can do it exactly with a plain np.ndarray, but In a complicated program, such complicated operations can be a source of mistakes.

xarray is a library that simplifies such operations. (By the way, since np.ndarray is used internally, the high-speed computing performance of np.ndarray is hardly sacrificed.) There is pandas as a library that handles one-dimensional data. Pandas can't (easily) handle multidimensional data. xarray is a library that interpolates it.

In addition to the above,

- The
`__str__`

method is overloaded and will give you an overview when you print. - Position indexing and slicing (for example, searching for data closest to a certain time) is possible. The result is also an xarray object, which correctly holds information about the axes.
- Simple statistical processing (moving average, etc.) is possible. It also holds information about the axis correctly.
- Mutual conversion with pandas is possible
- It seems to support huge data that does not fit in memory

And so on.

By the way,

```
import numpy as np
import xarray as xr
xr.__version__
```

```
'0.9.1'
```

It seems that it is common to abbreviate it as `xr`

.

It mainly supports two data types, xr.DataArray and xr.Dataset.

xr.DataArray
xr.DataArray is the multidimensional data mentioned above.
Inside, it has an ordered dictionary type `coords`

, which is a pair of axis values and labels, and an ordered dictionary type ʻattrs`, which stores other information.

Since we are overloading the `__get_item__`

method, we can access it like da [i, j], just like np.ndarray.
However, since the return value is also an xr.DataArray object, it inherits the axis information and so on.

xr.Dataset An object that holds multiple xr.DataArrays. You can have multiple axes and it will hold information about which axis each data corresponds to.

You can access it like a dictionary object. For example, in xr.Dataset which has temperature T and density N information inside. data ['T'] returns the temperature T as xr.DataArray.

This is a role similar to `DataSeries`

in` pandas`

.
It has the data value itself and the axis data.

```
data = xr.DataArray(np.random.randn(2, 3))
```

Then you can create a 2x3 xr.DataArray object with no axis information.

You can view a summary of the objects created by the `print`

method.

```
print(data)
```

```
<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[ 0.32853 , -1.010702, 1.220686],
[ 0.877681, 1.180265, -0.963936]])
Dimensions without coordinates: dim_0, dim_1
```

If you do not explicitly specify the axis like this time, dim_0 and dim_1 will be assigned automatically.

For example, consider the case where the first dimension of some data `data_np`

corresponds to the spatial position x and the second dimension corresponds to the time t.

```
#Example data
data_np = np.random.randn(5,4)
x_axis = np.linspace(0.0, 1.0, 5)
t_axis = np.linspace(0.0, 2.0, 4)
```

```
data = xr.DataArray(data_np, dims=['x','t'],
coords={'x':x_axis, 't':t_axis},
name='some_measurement')
```

And so on

- In
`dims`

, list (or tuple) the labels corresponding to each dimension of data_np. - Give
`coords`

the axis label and the corresponding data in dictionary form.

```
print(data)
```

```
<xarray.DataArray 'some_measurement' (x: 5, t: 4)>
array([[ 1.089975, 0.343039, -0.521509, 0.02816 ],
[ 1.117389, 0.589563, -1.030908, -0.023295],
[ 0.403413, -0.157136, -0.175684, -0.743779],
[ 0.814334, 0.164875, -0.489531, -0.335251],
[ 0.009115, 0.294526, 0.693384, -1.046416]])
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
* x (x) float64 0.0 0.25 0.5 0.75 1.0
```

Of the displayed summary

`<xarray.DataArray 'some_measurement' (x: 5, t: 4)>`

Indicates that this DataArray is a 5x4 matrix named `some_measurement`

, with the 1D axis label corresponding to'x'and the 2D axis label corresponding to't'.

Also,

`Coordinates:`

The following is a list of axis data.

The axis list can be accessed by `dims`

.
In addition, the order displayed here indicates which time period axis of the original data corresponds to.

```
data.dims
```

```
('x', 't')
```

To access the value of the axis, take the label name as an argument.

```
data['x']
```

```
<xarray.DataArray 'x' (x: 5)>
array([ 0. , 0.25, 0.5 , 0.75, 1. ])
Coordinates:
* x (x) float64 0.0 0.25 0.5 0.75 1.0
```

xarray supports multiple types of indexing. Since it uses the mechanism of pandas, it is as fast as pandas.

```
data[0,1]
```

```
<xarray.DataArray 'some_measurement' ()>
array(0.3430393695918721)
Coordinates:
t float64 0.6667
x float64 0.0
```

Since it is array-like, it can be accessed like a normal matrix. The axis information at that time is inherited.

By using the `.loc`

method, you can specify the position along the axis data and access it.

```
data.loc[0:0.5, :1.0]
```

```
<xarray.DataArray 'some_measurement' (x: 3, t: 2)>
array([[ 1.089975, 0.343039],
[ 1.117389, 0.589563],
[ 0.403413, -0.157136]])
Coordinates:
* t (t) float64 0.0 0.6667
* x (x) float64 0.0 0.25 0.5
```

`.loc[0:0.5, :1.0]`

Is an operation to cut out data in the range of 0 <x <0.5 along the axis of the first dimension and in the range of t <1.0 along the axis of the second dimension.

Use the `.isel`

and` .sel`

methods for access with an axis label name.

`.isel`

specifies the axis label and its index as an integer.

```
data.isel(t=1)
```

```
<xarray.DataArray 'some_measurement' (x: 5)>
array([ 0.343039, 0.589563, -0.157136, 0.164875, 0.294526])
Coordinates:
t float64 0.6667
* x (x) float64 0.0 0.25 0.5 0.75 1.0
```

`.sel`

specifies the axis label and its axis value.

```
data.sel(t=slice(0.5,2.0))
```

```
<xarray.DataArray 'some_measurement' (x: 5, t: 3)>
array([[ 0.343039, -0.521509, 0.02816 ],
[ 0.589563, -1.030908, -0.023295],
[-0.157136, -0.175684, -0.743779],
[ 0.164875, -0.489531, -0.335251],
[ 0.294526, 0.693384, -1.046416]])
Coordinates:
* t (t) float64 0.6667 1.333 2.0
* x (x) float64 0.0 0.25 0.5 0.75 1.0
```

It supports a lot of np.ndarray-like operations.

It supports basic operations including broadcast.

```
data+10
```

```
<xarray.DataArray 'some_measurement' (x: 5, t: 4)>
array([[ 11.089975, 10.343039, 9.478491, 10.02816 ],
[ 11.117389, 10.589563, 8.969092, 9.976705],
[ 10.403413, 9.842864, 9.824316, 9.256221],
[ 10.814334, 10.164875, 9.510469, 9.664749],
[ 10.009115, 10.294526, 10.693384, 8.953584]])
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
* x (x) float64 0.0 0.25 0.5 0.75 1.0
```

Element-by-element calculations can inherit this information.

```
np.sin(data)
```

```
<xarray.DataArray 'some_measurement' (x: 5, t: 4)>
array([[ 0.886616, 0.336351, -0.498189, 0.028156],
[ 0.89896 , 0.555998, -0.857766, -0.023293],
[ 0.39256 , -0.15649 , -0.174781, -0.677074],
[ 0.727269, 0.164129, -0.470212, -0.329006],
[ 0.009114, 0.290286, 0.639144, -0.865635]])
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
* x (x) float64 0.0 0.25 0.5 0.75 1.0
```

`xr.Dataset`

is an object that is a collection of multiple` xr.DataArray`

s.

In particular, you can index and slice `xr.DataArray`

that share an axis at once.
I think that one measuring instrument may output multiple types of signals,
It is suitable for handling such ** multidimensional ** information.

This is a role similar to `DataFrame`

in` pandas`

.

The first argument is that `data_vars`

is` dict`

-like.
Pass the name of the data to be stored in key and the tuple of two elements in the element.
The first element of the tuple passes the axis label corresponding to that data, and the second element passes the data (ʻarray`-like).

Pass `dict`

-like to` coords`

to store the axis data.
Pass the axis label for the key and the axis value for the element.

```
ds = xr.Dataset({'data1': (['x','t'], np.random.randn(5,4)), 'data2': (['x','t'], np.random.randn(5,4))},
coords={'x': x_axis, 't': t_axis})
```

```
ds
```

```
<xarray.Dataset>
Dimensions: (t: 4, x: 5)
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
* x (x) float64 0.0 0.25 0.5 0.75 1.0
Data variables:
data1 (x, t) float64 -1.091 -1.851 0.343 2.077 1.477 0.0009389 1.358 ...
data2 (x, t) float64 0.4852 -0.5463 -0.22 -1.357 -1.416 -0.4929 ...
```

To access the contents, pass the label name inside `[]`

.
In that case, the return value will be `xr.DataArray`

.

```
ds['data1']
```

```
<xarray.DataArray 'data1' (x: 5, t: 4)>
array([[ -1.091230e+00, -1.851416e+00, 3.429677e-01, 2.077113e+00],
[ 1.476765e+00, 9.389425e-04, 1.358136e+00, -1.627471e+00],
[ -2.007550e-01, 1.008126e-01, 7.177067e-01, 8.893402e-01],
[ -1.813395e-01, -3.407015e-01, -9.673550e-01, 1.135727e+00],
[ 2.423873e-01, -1.198268e+00, 1.650465e+00, -1.923102e-01]])
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
* x (x) float64 0.0 0.25 0.5 0.75 1.0
```

You can also access the axes by label.

```
ds['x']
```

```
<xarray.DataArray 'x' (x: 5)>
array([ 0. , 0.25, 0.5 , 0.75, 1. ])
Coordinates:
* x (x) float64 0.0 0.25 0.5 0.75 1.0
```

Use ʻisel` for index access. To access the first element along the x-axis, specify the axis label name and its corresponding index, as follows:

```
ds.isel(x=1)
```

```
<xarray.Dataset>
Dimensions: (t: 4)
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
x float64 0.25
Data variables:
data1 (t) float64 1.477 0.0009389 1.358 -1.627
data2 (t) float64 -1.416 -0.4929 0.4926 -0.7186
```

Of course you can specify multiple axes

```
ds.isel(x=1, t=2)
```

```
<xarray.Dataset>
Dimensions: ()
Coordinates:
t float64 1.333
x float64 0.25
Data variables:
data1 float64 1.358
data2 float64 0.4926
```

It also supports slicing.

```
ds.isel(x=slice(0,2,1), t=2)
```

```
<xarray.Dataset>
Dimensions: (x: 2)
Coordinates:
t float64 1.333
* x (x) float64 0.0 0.25
Data variables:
data1 (x) float64 0.343 1.358
data2 (x) float64 -0.22 0.4926
```

Use the `.sel`

method for position indexing.
As with `.isel`

, specify the axis label name and this time the axis value.

```
ds.sel(x=0.0)
```

```
<xarray.Dataset>
Dimensions: (t: 4)
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
x float64 0.0
Data variables:
data1 (t) float64 -1.091 -1.851 0.343 2.077
data2 (t) float64 0.4852 -0.5463 -0.22 -1.357
```

By default, exactly the same value is returned, but you can specify it with the `method`

option.
If you want the nearest value, set `method ='nearest'`

.

```
# x = 0.Returns the value with x closest to 4.
ds.sel(x=0.4, method='nearest')
```

```
<xarray.Dataset>
Dimensions: (t: 4)
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
x float64 0.5
Data variables:
data1 (t) float64 -0.2008 0.1008 0.7177 0.8893
data2 (t) float64 -0.03163 0.6942 0.8194 -2.93
```

You can also pass a slice object.

```
ds.sel(x=slice(0,0.5))
```

```
<xarray.Dataset>
Dimensions: (t: 4, x: 3)
Coordinates:
* t (t) float64 0.0 0.6667 1.333 2.0
* x (x) float64 0.0 0.25 0.5
Data variables:
data1 (x, t) float64 -1.091 -1.851 0.343 2.077 1.477 0.0009389 1.358 ...
data2 (x, t) float64 0.4852 -0.5463 -0.22 -1.357 -1.416 -0.4929 ...
```

```
```

Recommended Posts