# Data analysis using xarray

Previous article

I will introduce the data analysis using xarray introduced in the above from a more practical point of view.

Since 2017, he has been a development member of pydata xarray. I'm mostly a user these days, but sometimes I send out bug fix PR.

I use xarray for almost all data analysis that I do in my main business.

It is assumed that you have some knowledge of numpy (you can operate and perform operations using np.ndarray).

In this article, I will introduce the data model of xarray again, and then discuss the basic operation of indexing. It may be safe to say that this indexing is the essence of xarray.

After that, I will introduce concat, which is the reverse operation of indexing.

Based on these, when considering the actual creation of data, I will introduce what kind of data structure should be used to facilitate analysis with xarray.

# xarray data model

## Overview

In a nutshell, xarray is ** a multidimensional array with axes **. You can think of it as a combination of numpy's nd-array and pandas' pd.Series.

Once you get used to it, it's very convenient because you don't have to think about coordinates during data analysis.

On the other hand, it is often said that the usage is a little unique and the learning cost is high. This article aims to reduce the learning cost as much as possible.

There are two main classes in xarray. One is xarray.DataArray and the other is xarray.Dataset. I will introduce them.

This is almost the same as the content described in Multidimensional data analysis library xarray.

xarray.DataArray

Introducing xarray.DataArray, which is a ** multidimensional array with axes **. I will explain how to create an object later, but first I will explain using the data in xarray.tutorial.

``````import numpy as np
import xarray as xr
``````
``````data = xr.tutorial.open_dataset('air_temperature')['air']
data
``````
```<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
[3869000 values with dtype=float32]
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

Where `data` is a 3D xr.DataArray instance. The official abbreviation for xarray is `xr`.

If you display `data`, you will see a summary like the one above. The `(time: 2920, lat: 25, lon: 53)` in the top row represents the dimension name of the stored array and its number of elements.

As you can see, xarray ** treats each dimension with a name ** (called * label * in official documentation).

The first axis is the `time` axis, the second axis is the` lat` axis, and the third axis is the `lon` axis. By naming it like this, you don't have to worry about the order of the axes.

In addition, coordinate values are added to each axis. A list of coordinates is displayed in the `Coordinates` section. In this data, all axes of `time`,` lat`, and `lon` have coordinates.

You can also attach other data. They are in the `Attributes` section.

As described in detail in the indexing section, the presence of coordinate axes makes it possible to intuitively obtain data in the desired range. For example, if you want the data for August 30, 2014,

``````data.sel(time='2014-08-30')
``````
```<xarray.DataArray 'air' (time: 4, lat: 25, lon: 53)>
array([[[270.29   , 270.29   , ..., 258.9    , 259.5    ],
[273.     , 273.1    , ..., 262.6    , 264.19998],
...,
[299.5    , 298.79   , ..., 299.     , 298.5    ],
[299.29   , 299.29   , ..., 299.9    , 300.     ]],

[[269.9    , 270.1    , ..., 258.6    , 259.5    ],
[273.     , 273.29   , ..., 262.5    , 265.     ],
...,
[299.19998, 298.9    , ..., 298.5    , 298.     ],
[299.4    , 299.69998, ..., 299.19998, 299.6    ]],

[[270.4    , 270.69998, ..., 261.29   , 261.9    ],
[273.     , 273.6    , ..., 266.4    , 268.6    ],
...,
[297.9    , 297.6    , ..., 298.29   , 298.29   ],
[298.69998, 298.69998, ..., 299.1    , 299.4    ]],

[[270.     , 270.4    , ..., 266.     , 266.19998],
[273.     , 273.9    , ..., 268.1    , 269.69998],
...,
[298.5    , 298.29   , ..., 298.69998, 299.     ],
[299.1    , 299.19998, ..., 299.5    , 299.69998]]], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2014-08-30 ... 2014-08-30T18:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

You can do it like this. Here, the `.sel` method is used to select data using the coordinate axes. `.sel (time = '2014-08-30')` is equivalent to referencing the `time` axis and selecting the data that is`'2014-08-30'`.

### Operations on DataArray

DataArray is also a multidimensional array like np.ndarray, so you can perform operations like on np.ndarray.

``````data * 2
``````
```<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
array([[[482.4    , 485.     , 487.     , ..., 465.59998, 471.     ,
477.19998],
[487.59998, 489.     , 489.4    , ..., 465.59998, 470.59998,
478.59998],
[500.     , 499.59998, 497.78   , ..., 466.4    , 472.78   ,
483.4    ],
...,
[593.2    , 592.39996, 592.8    , ..., 590.8    , 590.2    ,
589.39996],
[591.8    , 592.39996, 593.58   , ..., 591.8    , 591.8    ,
590.39996],
[592.58   , 593.58   , 594.2    , ..., 593.8    , 593.58   ,
593.2    ]],

[[484.19998, 485.4    , 486.19998, ..., 464.     , 467.19998,
471.59998],
[487.19998, 488.19998, 488.4    , ..., 462.     , 465.     ,
471.4    ],
[506.4    , 505.78   , 504.19998, ..., 461.59998, 466.78   ,
477.     ],
...
[587.38   , 587.77997, 590.77997, ..., 590.18   , 589.38   ,
588.58   ],
[592.58   , 594.38   , 595.18   , ..., 590.58   , 590.18   ,
588.77997],
[595.58   , 596.77997, 596.98   , ..., 591.38   , 590.98   ,
590.38   ]],

[[490.18   , 488.58   , 486.58   , ..., 483.37997, 482.97998,
483.58   ],
[499.78   , 498.58   , 496.78   , ..., 479.18   , 480.58   ,
483.37997],
[525.98   , 524.38   , 522.77997, ..., 479.78   , 485.18   ,
492.58   ],
...,
[587.58   , 587.38   , 590.18   , ..., 590.58   , 590.18   ,
589.38   ],
[592.18   , 593.77997, 594.38   , ..., 591.38   , 591.38   ,
590.38   ],
[595.38   , 596.18   , 596.18   , ..., 592.98   , 592.38   ,
591.38   ]]], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00```

As intuitively it should be, the multiplication is done only on the array data and the axes data is kept unchanged.

### Data that can be held outside the array

I mentioned that you can hold Coordinates in addition to array data. Coordinate data can be categorized as follows.

• dimension coordinate
• non-dimension coordinate
• scalar coordinate

Dimension coordinate

This was introduced as Coordinate in the previous section. It has the same name as the dimension name (`time`,` lon`, `lat` in the data in the previous section).

Non-dimension coordinate

You can also hold coordinates with names different from dimension. I will explain in what situations it is necessary and convenient, but basically it should be understood that it is ** retained but not calculated **.

Scalar coordinate

You can hold a scalar that can be a coordinate axis. This is also a frequently used data type, but for the time being, you can think of it as data that is not calculated but is retained like Non-dimension coordinates.

### Basic method

It has the basic methods found in np.ndarray.

• `. Shape`: Returns the shape
• `.size`: Returns the total size
• `.ndim`: Returns the number of dimensions

others,

• `.sizes`: Returns an associative array of dimension names and sizes (dictionary type)

• `.dtype`: Returns the type of the array
• `.astype ()`: Change the type of the array
• `.real`: Returns the real part of all data
• `. Imag`: Returns the imaginary part of all data

### Convert from DataArray to np.ndarray.

The DataArray contains np.ndarray. (As a supplement, you can actually store multidimensional arrays other than np.ndarray.)

You can access the array of contents by using `.data`. (Note that `.values` converts any object in the array to np.ndarray and returns it.)

``````type(data.data), data.shape
``````
``````(numpy.ndarray, (2920, 25, 53))
``````

### Extract coordinate axis information

Coordinate axis objects can be retrieved by passing the axis name in `[]` like a dictionary.

``````data['lat']
``````
```<xarray.DataArray 'lat' (lat: 25)>
array([75. , 72.5, 70. , 67.5, 65. , 62.5, 60. , 57.5, 55. , 52.5, 50. , 47.5,
45. , 42.5, 40. , 37.5, 35. , 32.5, 30. , 27.5, 25. , 22.5, 20. , 17.5,
15. ], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
Attributes:
standard_name:  latitude
long_name:      Latitude
units:          degrees_north
axis:           Y```

The retrieved one is also a DataArray object.

xarray.Dataset

A Dataset is an object that collects multiple DataArrays.

``````data = xr.tutorial.open_dataset('air_temperature')
data['mean_temperature'] = data['air'].mean('time')
data
``````
```<xarray.Dataset>
Dimensions:           (lat: 25, lon: 53, time: 2920)
Coordinates:
* lat               (lat) float32 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0
* lon               (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0
* time              (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air               (time, lat, lon) float32 241.2 242.5 ... 296.19 295.69
mean_temperature  (lat, lon) float32 260.37564 260.1826 ... 297.30502
Attributes:
Conventions:  COARDS
title:        4x daily NMC reanalysis (1948)
description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
platform:     Model
references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...```

In this example, `data` holds two DataArrays,` air` and `mean_temperature`. They are listed in the section called `Data variables`.

• `air` is a three-dimensional array with (`time`, `lon`,` lat`) dimensions
• `mean_temperature` is a two-dimensional array with (`lon`, `lat`) dimensions

is.

The dataarrays you hold can share axes. Therefore, it is possible to index multiple DataArrays at the same time as shown below.

``````data.sel(lat=70, method='nearest')
``````
```<xarray.Dataset>
Dimensions:           (lon: 53, time: 2920)
Coordinates:
lat               float32 70.0
* lon               (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0
* time              (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air               (time, lon) float32 250.0 249.79999 ... 242.59 246.29
mean_temperature  (lon) float32 264.7681 264.3271 ... 253.58247 257.71475
Attributes:
Conventions:  COARDS
title:        4x daily NMC reanalysis (1948)
description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
platform:     Model
references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...```

The retained DataArray can have different dimensions. Whether or not the axes are common is determined by the name of the axis. Therefore, it is not possible to have coordinate axes with the same name, although they are not common.

To get a DataArray from a Dataset, just pass the name of the DataArray as a key, like a dictionary.

``````da = data['air']
da
``````
```<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
array([[[241.2    , 242.5    , ..., 235.5    , 238.59999],
[243.79999, 244.5    , ..., 235.29999, 239.29999],
...,
[295.9    , 296.19998, ..., 295.9    , 295.19998],
[296.29   , 296.79   , ..., 296.79   , 296.6    ]],

[[242.09999, 242.7    , ..., 233.59999, 235.79999],
[243.59999, 244.09999, ..., 232.5    , 235.7    ],
...,
[296.19998, 296.69998, ..., 295.5    , 295.1    ],
[296.29   , 297.19998, ..., 296.4    , 296.6    ]],

...,

[[245.79   , 244.79   , ..., 243.98999, 244.79   ],
[249.89   , 249.29   , ..., 242.48999, 244.29   ],
...,
[296.29   , 297.19   , ..., 295.09   , 294.38998],
[297.79   , 298.38998, ..., 295.49   , 295.19   ]],

[[245.09   , 244.29   , ..., 241.48999, 241.79   ],
[249.89   , 249.29   , ..., 240.29   , 241.68999],
...,
[296.09   , 296.88998, ..., 295.69   , 295.19   ],
[297.69   , 298.09   , ..., 296.19   , 295.69   ]]], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

## Instantiation of DataArray

This section describes how to create a DataArray object.

The easiest way is to give `np.ndarray` and each axis name.

``````xr.DataArray(np.arange(12).reshape(3, 4), dims=['x', 'y'])
``````
```<xarray.DataArray (x: 3, y: 4)>
array([[ 0,  1,  2,  3],
[ 4,  5,  6,  7],
[ 8,  9, 10, 11]])
Dimensions without coordinates: x, y```

The above example creates a DataArray with 3x4 elements. The name of the first dimension is `x` and the second dimension is` y`.

To give the axes, specify the `coords` keyword as a dictionary.

``````xr.DataArray(np.arange(12).reshape(3, 4), dims=['x', 'y'],
coords={'x': [0, 1, 2], 'y': [0.1, 0.2, 0.3, 0.4]})
``````
```<xarray.DataArray (x: 3, y: 4)>
array([[ 0,  1,  2,  3],
[ 4,  5,  6,  7],
[ 8,  9, 10, 11]])
Coordinates:
* x        (x) int64 0 1 2
* y        (y) float64 0.1 0.2 0.3 0.4```

To include a non-dimension coordinate, you must specify the axis name on which it depends. To do this, make the dictionary argument (to the right of the colon) a tuple. The first element is the axis name on which it depends, and the second is the array body.

``````xr.DataArray(np.arange(12).reshape(3, 4), dims=['x', 'y'],
coords={'x': [0, 1, 2], 'y': [0.1, 0.2, 0.3, 0.4],
'z': ('x', ['a', 'b', 'c'])})
``````
```<xarray.DataArray (x: 3, y: 4)>
array([[ 0,  1,  2,  3],
[ 4,  5,  6,  7],
[ 8,  9, 10, 11]])
Coordinates:
* x        (x) int64 0 1 2
* y        (y) float64 0.1 0.2 0.3 0.4
z        (x) <U1 'a' 'b' 'c'```

Where the non-dimension coordinate `z` is defined on the` x` axis.

Pass the scalar coordinate to `coords` as well.

``````xr.DataArray(np.arange(12).reshape(3, 4), dims=['x', 'y'],
coords={'x': [0, 1, 2], 'y': [0.1, 0.2, 0.3, 0.4], 'scalar': 3})
``````
```<xarray.DataArray (x: 3, y: 4)>
array([[ 0,  1,  2,  3],
[ 4,  5,  6,  7],
[ 8,  9, 10, 11]])
Coordinates:
* x        (x) int64 0 1 2
* y        (y) float64 0.1 0.2 0.3 0.4
scalar   int64 3```

You can see that the scalar coordinate does not belong to any dimension.

In addition to axis information, you can also pass the names of attributes and DataArrays. However, attributes are lost in the middle of the operation, so it may be better to put things that you want to save in a file at once.

I'll explain this in the chapter on data structures for storage, which I will explain later.

# Indexing `.isel`,` .sel`, `.interp`,` .reindex`

Indexing is the most basic and essential operation. It is no exaggeration to say that the development of xarray has started to simplify it.

xarray allows position-based indexing and axes-based indexing.

## Position-based indexing `.isel`

This is similar to indexing for common arrays such as np.ndarray. Give an integer as an argument in the brackets, such as `data [i, j, k]`. The argument indicates the position of the element in the array **.

``````da[0, :4, 3]
``````
```<xarray.DataArray 'air' (lat: 4)>
array([244.     , 244.2    , 247.5    , 266.69998], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5
lon      float32 207.5
time     datetime64[ns] 2013-01-01
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

With the above notation, you need to remember the position of each dimension (for example, the `time` axis is the very first axis).

It is surprisingly difficult to remember them for each analysis. If you write it like this, you only have to remember the name of the axis.

``````da.isel(time=0, lat=slice(0, 4), lon=3)
``````
```<xarray.DataArray 'air' (lat: 4)>
array([244.     , 244.2    , 247.5    , 266.69998], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5
lon      float32 207.5
time     datetime64[ns] 2013-01-01
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

In this way, put the axis name in keyword format in the argument of the `.isel ()` method, and give the corresponding index after the equal.

You need to explicitly use slices (`slice (0, 4)`) when specifying the range.

## Coordinate-based indexing `.sel`

In actual data analysis, the range of interest is often specified by coordinates. This can be achieved with xarray by using the `.sel` method.

``````da.sel(time='2013-01-01', lat=slice(75, 60))
``````
```<xarray.DataArray 'air' (time: 4, lat: 7, lon: 53)>
array([[[241.2    , 242.5    , ..., 235.5    , 238.59999],
[243.79999, 244.5    , ..., 235.29999, 239.29999],
...,
[272.1    , 270.9    , ..., 275.4    , 274.19998],
[273.69998, 273.6    , ..., 274.19998, 275.1    ]],

[[242.09999, 242.7    , ..., 233.59999, 235.79999],
[243.59999, 244.09999, ..., 232.5    , 235.7    ],
...,
[269.19998, 268.5    , ..., 275.5    , 274.69998],
[272.1    , 272.69998, ..., 275.79   , 276.19998]],

[[242.29999, 242.2    , ..., 236.09999, 238.7    ],
[244.59999, 244.39   , ..., 232.     , 235.7    ],
...,
[273.     , 273.5    , ..., 275.29   , 274.29   ],
[275.5    , 275.9    , ..., 277.4    , 277.6    ]],

[[241.89   , 241.79999, ..., 235.5    , 237.59999],
[246.29999, 245.29999, ..., 231.5    , 234.5    ],
...,
[273.29   , 272.6    , ..., 277.6    , 276.9    ],
[274.1    , 274.     , ..., 279.1    , 279.9    ]]], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2013-01-01T18:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

As with `.isel`, give the keyword the axis name and the desired coordinate value after the equal. You can also use the `slice` object. In that case, you can put a real number in the argument of `slice`.

However, there are some caveats to the `.sel` method.

### When you want to use approximate coordinate values

The `.sel` method has the options` method`, `tolerance`.

I think the most commonly used `method` is` nearest`. It will index using the closest coordinate value.

``````da.sel(lat=76, method='nearest')
``````
```<xarray.DataArray 'air' (time: 2920, lon: 53)>
array([[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    , 238.59999],
[242.09999, 242.7    , 243.09999, ..., 232.     , 233.59999, 235.79999],
[242.29999, 242.2    , 242.29999, ..., 234.29999, 236.09999, 238.7    ],
...,
[243.48999, 242.98999, 242.09   , ..., 244.18999, 244.48999, 244.89   ],
[245.79   , 244.79   , 243.48999, ..., 243.29   , 243.98999, 244.79   ],
[245.09   , 244.29   , 243.29   , ..., 241.68999, 241.48999, 241.79   ]],
dtype=float32)
Coordinates:
lat      float32 75.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

### When axis-based indexing does not work

Coordinate axes have a degree of freedom, so you may not be able to index them well. For example, the following cases are applicable.

#### When the coordinate axes are not monotonically increasing or monotonously decreasing

You can sort by that axis with the `sortby` method, like`da.sortby ('lat')`.

#### When the axes contain Nan

By using the `isnull` method that returns coordinates that are not` Nan`, such as `da.isel (lat = ~ da ['lat']. Isnull ())`, you can create an array that skips the coordinates. Masu

#### When there are overlapping values in the axes

With `da.isel (lat = np.unique (da ['lat'], return_index = True) [1])`, you can create an array with the parts corresponding to the duplicate coordinates skipped.

## Interpolation `.interp`

Interpolation is possible with the same syntax as indexing.

``````da.interp(lat=74)
``````
```<xarray.DataArray 'air' (time: 2920, lon: 53)>
array([[242.23999321, 243.29999998, 243.9799988 , ..., 232.79998779,
235.41999512, 238.87998962],
[242.69999081, 243.25999451, 243.53999329, ..., 231.60000001,
233.1599945 , 235.75999145],
[243.21998903, 243.07599789, 242.97999266, ..., 232.69998783,
234.45999444, 237.49999702],
...,
[245.72999275, 245.38999009, 244.68999648, ..., 242.74999088,
243.20999146, 244.00999451],
[247.42999566, 246.58999336, 245.48999023, ..., 242.4899933 ,
243.38999027, 244.58999329],
[247.00999749, 246.28999329, 245.32999587, ..., 240.84999084,
241.00999144, 241.74999085]])
Coordinates:
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
lat      int64 74
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

## Conditional indexing `xr.where`

There are times when you want to select only data that meets complex conditions. For example, in the above example

• I want to select only when there is a day of the week with `time`
• I want to select only those with `air` greater than or equal to a certain value

Or something.

When the selection result is one-dimensional, for example, when selecting a certain day of the week, it can be realized by passing a one-dimensional array of Bool type to the `.isel` method.

``````import datetime

#Choose only weekends
is_weekend = da['time'].dt.dayofweek.isin([5, 6])
is_weekend
``````
```<xarray.DataArray 'dayofweek' (time: 2920)>
array([False, False, False, ..., False, False, False])
Coordinates:
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00```
``````#bool type.Pass it to the isel method.
da.isel(time=is_weekend)
``````
```<xarray.DataArray 'air' (time: 832, lat: 25, lon: 53)>
array([[[248.59999, 246.89   , ..., 245.7    , 246.7    ],
[256.69998, 254.59999, ..., 248.     , 250.09999],
...,
[296.69998, 296.     , ..., 295.     , 294.9    ],
[297.     , 296.9    , ..., 296.79   , 296.69998]],

[[245.09999, 243.59999, ..., 249.89   , 251.39   ],
[254.     , 251.79999, ..., 247.79999, 250.89   ],
...,
[296.79   , 296.19998, ..., 294.6    , 294.69998],
[297.19998, 296.69998, ..., 296.29   , 296.6    ]],

...,

[[242.39   , 241.79999, ..., 247.59999, 247.29999],
[247.18999, 246.09999, ..., 253.29999, 254.29999],
...,
[294.79   , 295.29   , ..., 297.69998, 297.6    ],
[296.38998, 297.19998, ..., 298.19998, 298.29   ]],

[[249.18999, 248.     , ..., 244.09999, 242.5    ],
[254.5    , 253.     , ..., 251.     , 250.59999],
...,
[294.88998, 295.19998, ..., 298.     , 297.29   ],
[296.1    , 296.88998, ..., 298.5    , 298.29   ]]], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-05 ... 2014-12-28T18:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

When you want to make a selection based on the data of a multidimensional array, it is not possible to define the shape of the array after selection, so it is better to replace the unselected elements with Nan. In that case you can use the `where` function.

``````da.where(da > 270)
``````
```<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
array([[[      nan,       nan,       nan, ...,       nan,       nan,
nan],
[      nan,       nan,       nan, ...,       nan,       nan,
nan],
[      nan,       nan,       nan, ...,       nan,       nan,
nan],
...,
[296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    ,
294.69998],
[295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    ,
295.19998],
[296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   ,
296.6    ]],

[[      nan,       nan,       nan, ...,       nan,       nan,
nan],
[      nan,       nan,       nan, ...,       nan,       nan,
nan],
[      nan,       nan,       nan, ...,       nan,       nan,
nan],
...
[293.69   , 293.88998, 295.38998, ..., 295.09   , 294.69   ,
294.29   ],
[296.29   , 297.19   , 297.59   , ..., 295.29   , 295.09   ,
294.38998],
[297.79   , 298.38998, 298.49   , ..., 295.69   , 295.49   ,
295.19   ]],

[[      nan,       nan,       nan, ...,       nan,       nan,
nan],
[      nan,       nan,       nan, ...,       nan,       nan,
nan],
[      nan,       nan,       nan, ...,       nan,       nan,
nan],
...,
[293.79   , 293.69   , 295.09   , ..., 295.29   , 295.09   ,
294.69   ],
[296.09   , 296.88998, 297.19   , ..., 295.69   , 295.69   ,
295.19   ],
[297.69   , 298.09   , 298.09   , ..., 296.49   , 296.19   ,
295.69   ]]], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

Here, we have replaced the elements that do not apply to `da> 270` with` np.nan`.

## Evolving indexing

So far we have introduced basic indexing, but xarray also allows for advanced indexing.

As shown in this figure, you can cut out another array from the multidimensional array. In the official document, it is called Advanced indexing.

To achieve this, define the array passed as an argument as a DataArray. At that time, set the dimensions of those arrays to the dimensions of the newly created array.

According to the figure above, the new dimension is `z`, so we define a DataArray on` z`.

``````lon_new = xr.DataArray([60, 61, 62], dims=['z'], coords={'z': ['a', 'b', 'c']})
lat_new = xr.DataArray([16, 46, 76], dims=['z'], coords={'z': ['a', 'b', 'c']})
``````

If you pass these as arguments to the `.sel` method,`(lon, lat)`will be`[(60, 16), (61, 46), (62, 76)]`(closest to), respectively. Will choose.

``````da.sel(lon=lon_new, lat=lat_new, method='nearest')
``````
```<xarray.DataArray 'air' (time: 2920, z: 3)>
array([[296.29   , 280.     , 241.2    ],
[296.29   , 279.19998, 242.09999],
[296.4    , 278.6    , 242.29999],
...,
[298.19   , 279.99   , 243.48999],
[297.79   , 279.69   , 245.79   ],
[297.69   , 279.79   , 245.09   ]], dtype=float32)
Coordinates:
lat      (z) float32 15.0 45.0 75.0
lon      (z) float32 200.0 200.0 200.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
* z        (z) <U1 'a' 'b' 'c'
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

You can see that the newly obtained array no longer depends on the `lon`,` lat` axes, but on the `z`.

A similar method can also be used for interpolation.

``````da.interp(lon=lon_new, lat=lat_new)
``````
```<xarray.DataArray 'air' (time: 2920, z: 3)>
array([[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],
...,
[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan]])
Coordinates:
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
lon      (z) int64 60 61 62
lat      (z) int64 16 46 76
* z        (z) <U1 'a' 'b' 'c'
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

# Combine data `xr.concat`

Often you want to combine multiple xarray objects. For example, when you want to analyze experimental data and simulation data obtained with various parameters together.

This is the equivalent of `np.concatenate`,` np.stack`. I think it's easy to understand if you think of it as the reverse operation of indexing.

The syntax is

``````xr.concat([data1, data2, ..., ], dim='dim')
``````

is. Pass multiple DataArrays (or Datasets) as the first argument. The second argument is the direction of the join. Depending on the join direction, it can be divided into concatenate and stack operations.

### Specify an existing axis (concatenate behavior)

The basic operation of the `xr.concat` function is to connect multiple DataArrays along the existing axis.

For example, the following two DataArray

``````da0 = da.isel(time=slice(0, 10))
da1 = da.isel(time=slice(10, None))
da0
``````
```<xarray.DataArray 'air' (time: 10, lat: 25, lon: 53)>
array([[[241.2    , 242.5    , ..., 235.5    , 238.59999],
[243.79999, 244.5    , ..., 235.29999, 239.29999],
...,
[295.9    , 296.19998, ..., 295.9    , 295.19998],
[296.29   , 296.79   , ..., 296.79   , 296.6    ]],

[[242.09999, 242.7    , ..., 233.59999, 235.79999],
[243.59999, 244.09999, ..., 232.5    , 235.7    ],
...,
[296.19998, 296.69998, ..., 295.5    , 295.1    ],
[296.29   , 297.19998, ..., 296.4    , 296.6    ]],

...,

[[244.79999, 244.39   , ..., 242.7    , 244.79999],
[246.7    , 247.09999, ..., 237.79999, 240.2    ],
...,
[297.79   , 297.19998, ..., 296.4    , 295.29   ],
[297.9    , 297.69998, ..., 297.19998, 297.     ]],

[[243.89   , 243.79999, ..., 240.29999, 242.59999],
[245.5    , 245.79999, ..., 236.59999, 239.     ],
...,
[297.6    , 297.     , ..., 295.6    , 295.     ],
[298.1    , 297.69998, ..., 296.79   , 297.1    ]]], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2013-01-03T06:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

You can restore the original data by connecting in the `time` direction.

``````xr.concat([da0, da1], dim='time')
``````
```<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
array([[[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    ,
238.59999],
[243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999,
239.29999],
[250.     , 249.79999, 248.89   , ..., 233.2    , 236.39   ,
241.7    ],
...,
[296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    ,
294.69998],
[295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    ,
295.19998],
[296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   ,
296.6    ]],

[[242.09999, 242.7    , 243.09999, ..., 232.     , 233.59999,
235.79999],
[243.59999, 244.09999, 244.2    , ..., 231.     , 232.5    ,
235.7    ],
[253.2    , 252.89   , 252.09999, ..., 230.79999, 233.39   ,
238.5    ],
...
[293.69   , 293.88998, 295.38998, ..., 295.09   , 294.69   ,
294.29   ],
[296.29   , 297.19   , 297.59   , ..., 295.29   , 295.09   ,
294.38998],
[297.79   , 298.38998, 298.49   , ..., 295.69   , 295.49   ,
295.19   ]],

[[245.09   , 244.29   , 243.29   , ..., 241.68999, 241.48999,
241.79   ],
[249.89   , 249.29   , 248.39   , ..., 239.59   , 240.29   ,
241.68999],
[262.99   , 262.19   , 261.38998, ..., 239.89   , 242.59   ,
246.29   ],
...,
[293.79   , 293.69   , 295.09   , ..., 295.29   , 295.09   ,
294.69   ],
[296.09   , 296.88998, 297.19   , ..., 295.69   , 295.69   ,
295.19   ],
[297.69   , 298.09   , 298.09   , ..., 296.49   , 296.19   ,
295.69   ]]], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

Also note that the `time` axis has been restored as well.

### Specify a new axis (stack operation)

For example, suppose you have two DataArrays like this:

``````da0 = da.isel(time=0)
da1 = da.isel(time=1)
da0
``````
```<xarray.DataArray 'air' (lat: 25, lon: 53)>
array([[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    , 238.59999],
[243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999, 239.29999],
[250.     , 249.79999, 248.89   , ..., 233.2    , 236.39   , 241.7    ],
...,
[296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    , 294.69998],
[295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    , 295.19998],
[296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   , 296.6    ]],
dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
time     datetime64[ns] 2013-01-01
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

To combine this along the `time` axis, do the following:

``````xr.concat([da0, da1], dim='time')
``````
```<xarray.DataArray 'air' (time: 2, lat: 25, lon: 53)>
array([[[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    ,
238.59999],
[243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999,
239.29999],
[250.     , 249.79999, 248.89   , ..., 233.2    , 236.39   ,
241.7    ],
...,
[296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    ,
294.69998],
[295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    ,
295.19998],
[296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   ,
296.6    ]],

[[242.09999, 242.7    , 243.09999, ..., 232.     , 233.59999,
235.79999],
[243.59999, 244.09999, 244.2    , ..., 231.     , 232.5    ,
235.7    ],
[253.2    , 252.89   , 252.09999, ..., 230.79999, 233.39   ,
238.5    ],
...,
[296.4    , 295.9    , 296.19998, ..., 295.4    , 295.1    ,
294.79   ],
[296.19998, 296.69998, 296.79   , ..., 295.6    , 295.5    ,
295.1    ],
[296.29   , 297.19998, 297.4    , ..., 296.4    , 296.4    ,
296.6    ]]], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 2013-01-01T06:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

In this way, if there is a scalar coordinate called `time` in` da0`, `da1`, connect it in that direction. You can make the `time` axis a new dimension coordinate (by specifying a name in the` dim` keyword, such as `dim ='time'`).

Alternatively, you can create a new coordinate axis for that name by giving the `dim` keyword a DataArray.

# File IO

Another advantage of xarray is that it can store data such as DataArray and Dataset consistently and self-contained. It supports several file formats, but the easiest to use is probably netCDF.

Wikipedia also has an explanation.

NetCDF is a computer model-independent binary format (model-independent) that can read and write data as an array (array-oriented), and can store data as well as explanations about that data (self-descriptive). ). NetCDF is an international standard in the Open Geospatial Consortium, an international consortium that develops open geographic information standards.

Currently, versions 3 and 4 are in maintenance, but it's safer to use version 4. Version 4 conforms to the HDF5 standard, so even a regular HDF5 reader can read the contents.

The official extension is `.nc`.

## Save / load files in netCDF format

Package `netcdf4` is required to save in netCDF format.

``````pip install netcdf4
``````

Or

``````conda install netcdf4
``````

Let's run and install it on your system.

To save the xarray object in netCDF format, execute the `.to_netcdf` method.

``````data.to_netcdf('test.nc')
``````
``````/home/keisukefujii/miniconda3/envs/xarray/lib/python3.7/site-packages/ipykernel_launcher.py:1: SerializationWarning: saving variable air with floating point data as an integer dtype without any _FillValue to use for NaNs
"""Entry point for launching an IPython kernel.
``````

You should now have `test.nc` saved in your current path.

Conversely, to load the saved file, do `xr.load_dataset`.

``````xr.load_dataset('test.nc')
``````
```<xarray.Dataset>
Dimensions:           (lat: 25, lon: 53, time: 2920)
Coordinates:
* lat               (lat) float32 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0
* lon               (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0
* time              (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air               (time, lat, lon) float32 241.2 242.5 ... 296.19 295.69
mean_temperature  (lat, lon) float32 260.37564 260.1826 ... 297.30502
Attributes:
Conventions:  COARDS
title:        4x daily NMC reanalysis (1948)
description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
platform:     Model
references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...```

As you can see, not only the parts of the array, but also the axis names, coordinate values, and other information in `attributes` are stored.

### A note about `scipy.io.netcdf`

The xarray `.to_netcdf` uses the` netcdf4` library by default, but uses `scipy.io.netcdf` if it is not installed on your system. However, `scipy.io.netcdf` supports version 3, which is incompatible with version 4 and can be confusing.

Don't forget to install `netcdf4` on your system. Alternatively, you can specify the package to use explicitly, such as `.to_netcdf ('test.nc', engine ='netcdf4')`.

## Out-of-memory array

Files in netCDF format can be read randomly. Therefore, if you want to use only the data of a certain part of the file, it is efficient to read all the data when necessary instead of reading it from the hard disk first. This is inherently important, especially when dealing with files that are too large to fit in memory. This is called out-of-memory processing.

Out-of-memory processing in xarray is positioned as a fairly important yesterday, and various implementations are in progress. I'll cover the details in another article, and I'll just mention the most basic ones here.

Out-of-memory processing is performed by doing `open_dataset` instead of` load_dataset` when loading.

``````unloaded_data = xr.open_dataset('test.nc')
``````
```<xarray.Dataset>
Dimensions:           (lat: 25, lon: 53, time: 2920)
Coordinates:
* lat               (lat) float32 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0
* lon               (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0
* time              (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air               (time, lat, lon) float32 ...
mean_temperature  (lat, lon) float32 ...
Attributes:
Conventions:  COARDS
title:        4x daily NMC reanalysis (1948)
description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
platform:     Model
references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...```

If you look at `Data variables`, for example, you can see that there are no numbers in` air`, just `...`.

This indicates that no data has been read. However, since the coordinate axes are always read, the above indexing work can be performed as it is.

``````unloaded_data['air'].sel(lat=60, method='nearest')
``````
```<xarray.DataArray 'air' (time: 2920, lon: 53)>
[154760 values with dtype=float32]
Coordinates:
lat      float32 60.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]```

When I select a variable or index it, it still doesn't load. It will be loaded into memory for the first time when it is operated on or when `.data` or` .values` is called.

You can also use the `.compute ()` method to load it explicitly.

``````unloaded_data.compute()
``````
```<xarray.Dataset>
Dimensions:           (lat: 25, lon: 53, time: 2920)
Coordinates:
* lat               (lat) float32 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0
* lon               (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0
* time              (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air               (time, lat, lon) float32 241.2 242.5 ... 296.19 295.69
mean_temperature  (lat, lon) float32 260.37564 260.1826 ... 297.30502
Attributes:
Conventions:  COARDS
title:        4x daily NMC reanalysis (1948)
description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
platform:     Model
references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...```

### Note

`open_dataset`,` open_dataarray` locks the file for future reading. Therefore, if you touch a file with the same name in another program, an error will occur.

For example, consider the case where the result of a time-consuming calculation is output as a `.nc` file each time. If you want to see the progress and open it with `open_dataset`, then any subsequent attempts to write to that file will fail. It's better if it just fails and stops, but it seems that it often destroys existing data. This out-of-memory process should be used with caution.

You can use the `with` statement to ensure that the file is closed and unlocked, as shown below.

``````with xr.open_dataset('test.nc') as f:
print(f)
``````
``````<xarray.Dataset>
Dimensions:           (lat: 25, lon: 53, time: 2920)
Coordinates:
* lat               (lat) float32 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0
* lon               (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0
* time              (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air               (time, lat, lon) float32 ...
mean_temperature  (lat, lon) float32 ...
Attributes:
Conventions:  COARDS
title:        4x daily NMC reanalysis (1948)
description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
platform:     Model
references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
``````

# Recommended data structure

As mentioned above, if you save the data in netcdf format, you can analyze the data smoothly with xarray. When creating data for experiments or simulations, it is a good idea to create a data structure that considers the subsequent analysis and save it in the netcdf format including it.

It is recommended to keep the following data suppressed.

• Save coordinate axis data
• Experiment time and parameters to scan should be `scalar coordinate` instead of` attributes`
• It's a good idea to put something that doesn't change in `attributes` (such as information about the equipment used in the experiment and a description of the data).

In the case of an experiment, I think that it should be saved with the following structure.

``````from datetime import datetime
``````
``````#Let's assume that this is image data taken with a camera.
raw_data = np.arange(512 * 2048).reshape(512, 2048)
#The x-axis and y-axis coordinates of the image.
x = np.arange(512)
y = np.arange(2048)

#Position data for each pixel of the image
r = 300.0
X = r * np.cos(x / 512)[:, np.newaxis] + r * np.sin(y / 2048)
Y = r * np.sin(x / 512)[:, np.newaxis] - r * np.cos(y / 2048)

#Parameters used in the experiment
p0 = 3.0
p1 = 'normal'

#Experiment time
now = datetime.now()

data = xr.DataArray(
raw_data, dims=['x', 'y'],
coords={'x': x, 'y': y,
'X': (('x', 'y'), X),
'Y': (('x', 'y'), Y),
'p0': p0, 'p1': p1,
'datetime': now},
attrs={'camera type': 'flash 4',
'about': 'Example of the recommended data structure.'})
data
``````
```<xarray.DataArray (x: 512, y: 2048)>
array([[      0,       1,       2, ...,    2045,    2046,    2047],
[   2048,    2049,    2050, ...,    4093,    4094,    4095],
[   4096,    4097,    4098, ...,    6141,    6142,    6143],
...,
[1042432, 1042433, 1042434, ..., 1044477, 1044478, 1044479],
[1044480, 1044481, 1044482, ..., 1046525, 1046526, 1046527],
[1046528, 1046529, 1046530, ..., 1048573, 1048574, 1048575]])
Coordinates:
* x         (x) int64 0 1 2 3 4 5 6 7 8 ... 504 505 506 507 508 509 510 511
* y         (y) int64 0 1 2 3 4 5 6 7 ... 2041 2042 2043 2044 2045 2046 2047
X         (x, y) float64 300.0 300.1 300.3 300.4 ... 414.7 414.8 414.9 414.9
Y         (x, y) float64 -300.0 -300.0 -300.0 -300.0 ... 89.66 89.79 89.91
p0        float64 3.0
p1        <U6 'normal'
datetime  datetime64[ns] 2020-11-13T03:31:56.219372
Attributes:
camera type:  flash 4
about:        Example of the recommended data structure.```

By collecting the things that can change during the experiment in scalar coordinates, it will be easier to analyze later when you want to know the dependencies on them.

# Other useful functions / methods

Reduction `.sum`, `.mean`, etc

Reduction processes such as `np.sum` and` np.mean` are provided as methods. For example, `np.sum` allows you to specify in which axis direction to add. Similarly, xarray allows you to specify an axis as the axis name.

``````data = xr.tutorial.open_dataset('air_temperature')['air']

data.sum('lat')
``````
```<xarray.DataArray 'air' (time: 2920, lon: 53)>
array([[6984.9497, 6991.6606, 6991.5303, ..., 6998.77  , 7007.8804,
7016.5605],
[6976.4307, 6988.45  , 6993.2407, ..., 6994.3906, 7006.7505,
7019.941 ],
[6975.2603, 6982.02  , 6988.77  , ..., 6992.0503, 7004.9404,
7020.3506],
...,
[6990.7505, 6998.3496, 7013.3496, ..., 6995.05  , 7008.6504,
7019.4497],
[6984.95  , 6991.6504, 7007.949 , ..., 6994.15  , 7008.55  ,
7020.8506],
[6981.75  , 6983.85  , 6997.0503, ..., 6985.6494, 6999.2495,
7012.0493]], dtype=float32)
Coordinates:
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00```

If you add up in a certain axial direction in this way, there will be no coordinate axes in that direction. The other axes remain.

Note that when executing `.sum` etc., by default` np.nansum` is called instead of `np.sum`, and the calculation skipping` Nan` is performed. If you don't want to skip `Nan` (if you want to use` np.sum` instead of `np.nansum`), you need to give` skipna = False`.

## Drawing `.plot`

`DataArray` has a` plot` method, which makes it easy to visualize the contents of the data.

``````data.sel(time='2014-08-30 08:00', method='nearest').plot()
``````
``````<matplotlib.collections.QuadMesh at 0x7f9a98557310>
``````

It may be better to think of it as a simple one rather than to make a final diagram of a paper / report, but in order to visualize and understand the data with Jupyter, Ipython, etc. and proceed with the analysis. Very convenient to use.