[PYTHON] Data analysis using xarray

Data analysis using xarray

Previous article

I will introduce the data analysis using xarray introduced in the above from a more practical point of view.

About the author

Since 2017, he has been a development member of pydata xarray. I'm mostly a user these days, but sometimes I send out bug fix PR.

I use xarray for almost all data analysis that I do in my main business.

About the intended reader

It is assumed that you have some knowledge of numpy (you can operate and perform operations using np.ndarray).

Structure of this article

In this article, I will introduce the data model of xarray again, and then discuss the basic operation of indexing. It may be safe to say that this indexing is the essence of xarray.

After that, I will introduce concat, which is the reverse operation of indexing.

Based on these, when considering the actual creation of data, I will introduce what kind of data structure should be used to facilitate analysis with xarray.

I will also talk a little about saving and loading data and out-of-memory processing using dask.

This article is written using Google collaboratory. Anyone can execute and trace by copying from here.

xarray data model

Overview

In a nutshell, xarray is ** a multidimensional array with axes **. You can think of it as a combination of numpy's nd-array and pandas' pd.Series.

Once you get used to it, it's very convenient because you don't have to think about coordinates during data analysis.

On the other hand, it is often said that the usage is a little unique and the learning cost is high. This article aims to reduce the learning cost as much as possible.

There are two main classes in xarray. One is xarray.DataArray and the other is xarray.Dataset. I will introduce them.

This is almost the same as the content described in Multidimensional data analysis library xarray.

xarray.DataArray

Introducing xarray.DataArray, which is a ** multidimensional array with axes **. I will explain how to create an object later, but first I will explain using the data in xarray.tutorial.

import numpy as np
import xarray as xr
data = xr.tutorial.open_dataset('air_temperature')['air']
data
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
[3869000 values with dtype=float32]
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

Where data is a 3D xr.DataArray instance. The official abbreviation for xarray is xr.

If you display data, you will see a summary like the one above. The (time: 2920, lat: 25, lon: 53) in the top row represents the dimension name of the stored array and its number of elements.

As you can see, xarray ** treats each dimension with a name ** (called * label * in official documentation).

The first axis is the time axis, the second axis is the lat axis, and the third axis is the lon axis. By naming it like this, you don't have to worry about the order of the axes.

In addition, coordinate values are added to each axis. A list of coordinates is displayed in the Coordinates section. In this data, all axes of time, lat, and lon have coordinates.

You can also attach other data. They are in the Attributes section.

As described in detail in the indexing section, the presence of coordinate axes makes it possible to intuitively obtain data in the desired range. For example, if you want the data for August 30, 2014,

data.sel(time='2014-08-30')
<xarray.DataArray 'air' (time: 4, lat: 25, lon: 53)>
array([[[270.29   , 270.29   , ..., 258.9    , 259.5    ],
        [273.     , 273.1    , ..., 262.6    , 264.19998],
        ...,
        [299.5    , 298.79   , ..., 299.     , 298.5    ],
        [299.29   , 299.29   , ..., 299.9    , 300.     ]],

       [[269.9    , 270.1    , ..., 258.6    , 259.5    ],
        [273.     , 273.29   , ..., 262.5    , 265.     ],
        ...,
        [299.19998, 298.9    , ..., 298.5    , 298.     ],
        [299.4    , 299.69998, ..., 299.19998, 299.6    ]],

       [[270.4    , 270.69998, ..., 261.29   , 261.9    ],
        [273.     , 273.6    , ..., 266.4    , 268.6    ],
        ...,
        [297.9    , 297.6    , ..., 298.29   , 298.29   ],
        [298.69998, 298.69998, ..., 299.1    , 299.4    ]],

       [[270.     , 270.4    , ..., 266.     , 266.19998],
        [273.     , 273.9    , ..., 268.1    , 269.69998],
        ...,
        [298.5    , 298.29   , ..., 298.69998, 299.     ],
        [299.1    , 299.19998, ..., 299.5    , 299.69998]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2014-08-30 ... 2014-08-30T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

You can do it like this. Here, the .sel method is used to select data using the coordinate axes. .sel (time = '2014-08-30') is equivalent to referencing the time axis and selecting the data that is'2014-08-30'.

Operations on DataArray

DataArray is also a multidimensional array like np.ndarray, so you can perform operations like on np.ndarray.

data * 2
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
array([[[482.4    , 485.     , 487.     , ..., 465.59998, 471.     ,
         477.19998],
        [487.59998, 489.     , 489.4    , ..., 465.59998, 470.59998,
         478.59998],
        [500.     , 499.59998, 497.78   , ..., 466.4    , 472.78   ,
         483.4    ],
        ...,
        [593.2    , 592.39996, 592.8    , ..., 590.8    , 590.2    ,
         589.39996],
        [591.8    , 592.39996, 593.58   , ..., 591.8    , 591.8    ,
         590.39996],
        [592.58   , 593.58   , 594.2    , ..., 593.8    , 593.58   ,
         593.2    ]],

       [[484.19998, 485.4    , 486.19998, ..., 464.     , 467.19998,
         471.59998],
        [487.19998, 488.19998, 488.4    , ..., 462.     , 465.     ,
         471.4    ],
        [506.4    , 505.78   , 504.19998, ..., 461.59998, 466.78   ,
         477.     ],
...
        [587.38   , 587.77997, 590.77997, ..., 590.18   , 589.38   ,
         588.58   ],
        [592.58   , 594.38   , 595.18   , ..., 590.58   , 590.18   ,
         588.77997],
        [595.58   , 596.77997, 596.98   , ..., 591.38   , 590.98   ,
         590.38   ]],

       [[490.18   , 488.58   , 486.58   , ..., 483.37997, 482.97998,
         483.58   ],
        [499.78   , 498.58   , 496.78   , ..., 479.18   , 480.58   ,
         483.37997],
        [525.98   , 524.38   , 522.77997, ..., 479.78   , 485.18   ,
         492.58   ],
        ...,
        [587.58   , 587.38   , 590.18   , ..., 590.58   , 590.18   ,
         589.38   ],
        [592.18   , 593.77997, 594.38   , ..., 591.38   , 591.38   ,
         590.38   ],
        [595.38   , 596.18   , 596.18   , ..., 592.98   , 592.38   ,
         591.38   ]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00

As intuitively it should be, the multiplication is done only on the array data and the axes data is kept unchanged.

Data that can be held outside the array

I mentioned that you can hold Coordinates in addition to array data. Coordinate data can be categorized as follows.

Dimension coordinate

This was introduced as Coordinate in the previous section. It has the same name as the dimension name (time, lon, lat in the data in the previous section).

Non-dimension coordinate

You can also hold coordinates with names different from dimension. I will explain in what situations it is necessary and convenient, but basically it should be understood that it is ** retained but not calculated **.

Scalar coordinate

You can hold a scalar that can be a coordinate axis. This is also a frequently used data type, but for the time being, you can think of it as data that is not calculated but is retained like Non-dimension coordinates.

Basic method

It has the basic methods found in np.ndarray.

others,

Thing about type

Convert from DataArray to np.ndarray.

The DataArray contains np.ndarray. (As a supplement, you can actually store multidimensional arrays other than np.ndarray.)

You can access the array of contents by using .data. (Note that .values converts any object in the array to np.ndarray and returns it.)

type(data.data), data.shape
(numpy.ndarray, (2920, 25, 53))

Extract coordinate axis information

Coordinate axis objects can be retrieved by passing the axis name in [] like a dictionary.

data['lat']
<xarray.DataArray 'lat' (lat: 25)>
array([75. , 72.5, 70. , 67.5, 65. , 62.5, 60. , 57.5, 55. , 52.5, 50. , 47.5,
       45. , 42.5, 40. , 37.5, 35. , 32.5, 30. , 27.5, 25. , 22.5, 20. , 17.5,
       15. ], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
Attributes:
    standard_name:  latitude
    long_name:      Latitude
    units:          degrees_north
    axis:           Y

The retrieved one is also a DataArray object.

xarray.Dataset

A Dataset is an object that collects multiple DataArrays.

data = xr.tutorial.open_dataset('air_temperature')
data['mean_temperature'] = data['air'].mean('time')
data
<xarray.Dataset>
Dimensions:           (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat               (lat) float32 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0
  * lon               (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0
  * time              (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air               (time, lat, lon) float32 241.2 242.5 ... 296.19 295.69
    mean_temperature  (lat, lon) float32 260.37564 260.1826 ... 297.30502
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

In this example, data holds two DataArrays, air and mean_temperature. They are listed in the section called Data variables.

is.

The dataarrays you hold can share axes. Therefore, it is possible to index multiple DataArrays at the same time as shown below.

data.sel(lat=70, method='nearest')
<xarray.Dataset>
Dimensions:           (lon: 53, time: 2920)
Coordinates:
    lat               float32 70.0
  * lon               (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0
  * time              (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air               (time, lon) float32 250.0 249.79999 ... 242.59 246.29
    mean_temperature  (lon) float32 264.7681 264.3271 ... 253.58247 257.71475
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

The retained DataArray can have different dimensions. Whether or not the axes are common is determined by the name of the axis. Therefore, it is not possible to have coordinate axes with the same name, although they are not common.

To get a DataArray from a Dataset, just pass the name of the DataArray as a key, like a dictionary.

da = data['air']
da
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
array([[[241.2    , 242.5    , ..., 235.5    , 238.59999],
        [243.79999, 244.5    , ..., 235.29999, 239.29999],
        ...,
        [295.9    , 296.19998, ..., 295.9    , 295.19998],
        [296.29   , 296.79   , ..., 296.79   , 296.6    ]],

       [[242.09999, 242.7    , ..., 233.59999, 235.79999],
        [243.59999, 244.09999, ..., 232.5    , 235.7    ],
        ...,
        [296.19998, 296.69998, ..., 295.5    , 295.1    ],
        [296.29   , 297.19998, ..., 296.4    , 296.6    ]],

       ...,

       [[245.79   , 244.79   , ..., 243.98999, 244.79   ],
        [249.89   , 249.29   , ..., 242.48999, 244.29   ],
        ...,
        [296.29   , 297.19   , ..., 295.09   , 294.38998],
        [297.79   , 298.38998, ..., 295.49   , 295.19   ]],

       [[245.09   , 244.29   , ..., 241.48999, 241.79   ],
        [249.89   , 249.29   , ..., 240.29   , 241.68999],
        ...,
        [296.09   , 296.88998, ..., 295.69   , 295.19   ],
        [297.69   , 298.09   , ..., 296.19   , 295.69   ]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

Instantiation of DataArray

This section describes how to create a DataArray object.

The easiest way is to give np.ndarray and each axis name.

xr.DataArray(np.arange(12).reshape(3, 4), dims=['x', 'y'])
<xarray.DataArray (x: 3, y: 4)>
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
Dimensions without coordinates: x, y

The above example creates a DataArray with 3x4 elements. The name of the first dimension is x and the second dimension is y.

To give the axes, specify the coords keyword as a dictionary.

xr.DataArray(np.arange(12).reshape(3, 4), dims=['x', 'y'], 
             coords={'x': [0, 1, 2], 'y': [0.1, 0.2, 0.3, 0.4]})
<xarray.DataArray (x: 3, y: 4)>
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) float64 0.1 0.2 0.3 0.4

To include a non-dimension coordinate, you must specify the axis name on which it depends. To do this, make the dictionary argument (to the right of the colon) a tuple. The first element is the axis name on which it depends, and the second is the array body.

xr.DataArray(np.arange(12).reshape(3, 4), dims=['x', 'y'], 
             coords={'x': [0, 1, 2], 'y': [0.1, 0.2, 0.3, 0.4], 
                     'z': ('x', ['a', 'b', 'c'])})
<xarray.DataArray (x: 3, y: 4)>
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) float64 0.1 0.2 0.3 0.4
    z        (x) <U1 'a' 'b' 'c'

Where the non-dimension coordinate z is defined on the x axis.

Pass the scalar coordinate to coords as well.

xr.DataArray(np.arange(12).reshape(3, 4), dims=['x', 'y'], 
             coords={'x': [0, 1, 2], 'y': [0.1, 0.2, 0.3, 0.4], 'scalar': 3})
<xarray.DataArray (x: 3, y: 4)>
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) float64 0.1 0.2 0.3 0.4
    scalar   int64 3

You can see that the scalar coordinate does not belong to any dimension.

In addition to axis information, you can also pass the names of attributes and DataArrays. However, attributes are lost in the middle of the operation, so it may be better to put things that you want to save in a file at once.

I'll explain this in the chapter on data structures for storage, which I will explain later.

Indexing .isel, .sel, .interp, .reindex

Indexing is the most basic and essential operation. It is no exaggeration to say that the development of xarray has started to simplify it.

xarray allows position-based indexing and axes-based indexing.

Position-based indexing .isel

This is similar to indexing for common arrays such as np.ndarray. Give an integer as an argument in the brackets, such as data [i, j, k]. The argument indicates the position of the element in the array **.

da[0, :4, 3]
<xarray.DataArray 'air' (lat: 4)>
array([244.     , 244.2    , 247.5    , 266.69998], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5
    lon      float32 207.5
    time     datetime64[ns] 2013-01-01
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

With the above notation, you need to remember the position of each dimension (for example, the time axis is the very first axis).

It is surprisingly difficult to remember them for each analysis. If you write it like this, you only have to remember the name of the axis.

da.isel(time=0, lat=slice(0, 4), lon=3)
<xarray.DataArray 'air' (lat: 4)>
array([244.     , 244.2    , 247.5    , 266.69998], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5
    lon      float32 207.5
    time     datetime64[ns] 2013-01-01
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

In this way, put the axis name in keyword format in the argument of the .isel () method, and give the corresponding index after the equal.

You need to explicitly use slices (slice (0, 4)) when specifying the range.

Coordinate-based indexing .sel

In actual data analysis, the range of interest is often specified by coordinates. This can be achieved with xarray by using the .sel method.

da.sel(time='2013-01-01', lat=slice(75, 60))
<xarray.DataArray 'air' (time: 4, lat: 7, lon: 53)>
array([[[241.2    , 242.5    , ..., 235.5    , 238.59999],
        [243.79999, 244.5    , ..., 235.29999, 239.29999],
        ...,
        [272.1    , 270.9    , ..., 275.4    , 274.19998],
        [273.69998, 273.6    , ..., 274.19998, 275.1    ]],

       [[242.09999, 242.7    , ..., 233.59999, 235.79999],
        [243.59999, 244.09999, ..., 232.5    , 235.7    ],
        ...,
        [269.19998, 268.5    , ..., 275.5    , 274.69998],
        [272.1    , 272.69998, ..., 275.79   , 276.19998]],

       [[242.29999, 242.2    , ..., 236.09999, 238.7    ],
        [244.59999, 244.39   , ..., 232.     , 235.7    ],
        ...,
        [273.     , 273.5    , ..., 275.29   , 274.29   ],
        [275.5    , 275.9    , ..., 277.4    , 277.6    ]],

       [[241.89   , 241.79999, ..., 235.5    , 237.59999],
        [246.29999, 245.29999, ..., 231.5    , 234.5    ],
        ...,
        [273.29   , 272.6    , ..., 277.6    , 276.9    ],
        [274.1    , 274.     , ..., 279.1    , 279.9    ]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2013-01-01T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

As with .isel, give the keyword the axis name and the desired coordinate value after the equal. You can also use the slice object. In that case, you can put a real number in the argument of slice.

However, there are some caveats to the .sel method.

When you want to use approximate coordinate values

The .sel method has the options method, tolerance.

I think the most commonly used method is nearest. It will index using the closest coordinate value.

da.sel(lat=76, method='nearest')
<xarray.DataArray 'air' (time: 2920, lon: 53)>
array([[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    , 238.59999],
       [242.09999, 242.7    , 243.09999, ..., 232.     , 233.59999, 235.79999],
       [242.29999, 242.2    , 242.29999, ..., 234.29999, 236.09999, 238.7    ],
       ...,
       [243.48999, 242.98999, 242.09   , ..., 244.18999, 244.48999, 244.89   ],
       [245.79   , 244.79   , 243.48999, ..., 243.29   , 243.98999, 244.79   ],
       [245.09   , 244.29   , 243.29   , ..., 241.68999, 241.48999, 241.79   ]],
      dtype=float32)
Coordinates:
    lat      float32 75.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

When axis-based indexing does not work

Coordinate axes have a degree of freedom, so you may not be able to index them well. For example, the following cases are applicable.

When the coordinate axes are not monotonically increasing or monotonously decreasing

You can sort by that axis with the sortby method, likeda.sortby ('lat').

When the axes contain Nan

By using the isnull method that returns coordinates that are not Nan, such as da.isel (lat = ~ da ['lat']. Isnull ()), you can create an array that skips the coordinates. Masu

When there are overlapping values in the axes

With da.isel (lat = np.unique (da ['lat'], return_index = True) [1]), you can create an array with the parts corresponding to the duplicate coordinates skipped.

Interpolation .interp

Interpolation is possible with the same syntax as indexing.

da.interp(lat=74)
<xarray.DataArray 'air' (time: 2920, lon: 53)>
array([[242.23999321, 243.29999998, 243.9799988 , ..., 232.79998779,
        235.41999512, 238.87998962],
       [242.69999081, 243.25999451, 243.53999329, ..., 231.60000001,
        233.1599945 , 235.75999145],
       [243.21998903, 243.07599789, 242.97999266, ..., 232.69998783,
        234.45999444, 237.49999702],
       ...,
       [245.72999275, 245.38999009, 244.68999648, ..., 242.74999088,
        243.20999146, 244.00999451],
       [247.42999566, 246.58999336, 245.48999023, ..., 242.4899933 ,
        243.38999027, 244.58999329],
       [247.00999749, 246.28999329, 245.32999587, ..., 240.84999084,
        241.00999144, 241.74999085]])
Coordinates:
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
    lat      int64 74
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

Conditional indexing xr.where

There are times when you want to select only data that meets complex conditions. For example, in the above example

Or something.

When the selection result is one-dimensional, for example, when selecting a certain day of the week, it can be realized by passing a one-dimensional array of Bool type to the .isel method.

import datetime

#Choose only weekends
is_weekend = da['time'].dt.dayofweek.isin([5, 6])
is_weekend
<xarray.DataArray 'dayofweek' (time: 2920)>
array([False, False, False, ..., False, False, False])
Coordinates:
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
#bool type.Pass it to the isel method.
da.isel(time=is_weekend)
<xarray.DataArray 'air' (time: 832, lat: 25, lon: 53)>
array([[[248.59999, 246.89   , ..., 245.7    , 246.7    ],
        [256.69998, 254.59999, ..., 248.     , 250.09999],
        ...,
        [296.69998, 296.     , ..., 295.     , 294.9    ],
        [297.     , 296.9    , ..., 296.79   , 296.69998]],

       [[245.09999, 243.59999, ..., 249.89   , 251.39   ],
        [254.     , 251.79999, ..., 247.79999, 250.89   ],
        ...,
        [296.79   , 296.19998, ..., 294.6    , 294.69998],
        [297.19998, 296.69998, ..., 296.29   , 296.6    ]],

       ...,

       [[242.39   , 241.79999, ..., 247.59999, 247.29999],
        [247.18999, 246.09999, ..., 253.29999, 254.29999],
        ...,
        [294.79   , 295.29   , ..., 297.69998, 297.6    ],
        [296.38998, 297.19998, ..., 298.19998, 298.29   ]],

       [[249.18999, 248.     , ..., 244.09999, 242.5    ],
        [254.5    , 253.     , ..., 251.     , 250.59999],
        ...,
        [294.88998, 295.19998, ..., 298.     , 297.29   ],
        [296.1    , 296.88998, ..., 298.5    , 298.29   ]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-05 ... 2014-12-28T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

When you want to make a selection based on the data of a multidimensional array, it is not possible to define the shape of the array after selection, so it is better to replace the unselected elements with Nan. In that case you can use the where function.

da.where(da > 270)
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
array([[[      nan,       nan,       nan, ...,       nan,       nan,
               nan],
        [      nan,       nan,       nan, ...,       nan,       nan,
               nan],
        [      nan,       nan,       nan, ...,       nan,       nan,
               nan],
        ...,
        [296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    ,
         294.69998],
        [295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    ,
         295.19998],
        [296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   ,
         296.6    ]],

       [[      nan,       nan,       nan, ...,       nan,       nan,
               nan],
        [      nan,       nan,       nan, ...,       nan,       nan,
               nan],
        [      nan,       nan,       nan, ...,       nan,       nan,
               nan],
...
        [293.69   , 293.88998, 295.38998, ..., 295.09   , 294.69   ,
         294.29   ],
        [296.29   , 297.19   , 297.59   , ..., 295.29   , 295.09   ,
         294.38998],
        [297.79   , 298.38998, 298.49   , ..., 295.69   , 295.49   ,
         295.19   ]],

       [[      nan,       nan,       nan, ...,       nan,       nan,
               nan],
        [      nan,       nan,       nan, ...,       nan,       nan,
               nan],
        [      nan,       nan,       nan, ...,       nan,       nan,
               nan],
        ...,
        [293.79   , 293.69   , 295.09   , ..., 295.29   , 295.09   ,
         294.69   ],
        [296.09   , 296.88998, 297.19   , ..., 295.69   , 295.69   ,
         295.19   ],
        [297.69   , 298.09   , 298.09   , ..., 296.49   , 296.19   ,
         295.69   ]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

Here, we have replaced the elements that do not apply to da> 270 with np.nan.

Evolving indexing

So far we have introduced basic indexing, but xarray also allows for advanced indexing.

As shown in this figure, you can cut out another array from the multidimensional array. In the official document, it is called Advanced indexing.

image.png

To achieve this, define the array passed as an argument as a DataArray. At that time, set the dimensions of those arrays to the dimensions of the newly created array.

According to the figure above, the new dimension is z, so we define a DataArray on z.

lon_new = xr.DataArray([60, 61, 62], dims=['z'], coords={'z': ['a', 'b', 'c']})
lat_new = xr.DataArray([16, 46, 76], dims=['z'], coords={'z': ['a', 'b', 'c']})

If you pass these as arguments to the .sel method,(lon, lat)will be[(60, 16), (61, 46), (62, 76)](closest to), respectively. Will choose.

da.sel(lon=lon_new, lat=lat_new, method='nearest')
<xarray.DataArray 'air' (time: 2920, z: 3)>
array([[296.29   , 280.     , 241.2    ],
       [296.29   , 279.19998, 242.09999],
       [296.4    , 278.6    , 242.29999],
       ...,
       [298.19   , 279.99   , 243.48999],
       [297.79   , 279.69   , 245.79   ],
       [297.69   , 279.79   , 245.09   ]], dtype=float32)
Coordinates:
    lat      (z) float32 15.0 45.0 75.0
    lon      (z) float32 200.0 200.0 200.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
  * z        (z) <U1 'a' 'b' 'c'
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

You can see that the newly obtained array no longer depends on the lon, lat axes, but on the z.

A similar method can also be used for interpolation.

da.interp(lon=lon_new, lat=lat_new)
<xarray.DataArray 'air' (time: 2920, z: 3)>
array([[nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       ...,
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan]])
Coordinates:
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
    lon      (z) int64 60 61 62
    lat      (z) int64 16 46 76
  * z        (z) <U1 'a' 'b' 'c'
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

Combine data xr.concat

Often you want to combine multiple xarray objects. For example, when you want to analyze experimental data and simulation data obtained with various parameters together.

This is the equivalent of np.concatenate, np.stack. I think it's easy to understand if you think of it as the reverse operation of indexing.

The syntax is

xr.concat([data1, data2, ..., ], dim='dim')

is. Pass multiple DataArrays (or Datasets) as the first argument. The second argument is the direction of the join. Depending on the join direction, it can be divided into concatenate and stack operations.

Specify an existing axis (concatenate behavior)

The basic operation of the xr.concat function is to connect multiple DataArrays along the existing axis.

For example, the following two DataArray

da0 = da.isel(time=slice(0, 10))
da1 = da.isel(time=slice(10, None))
da0
<xarray.DataArray 'air' (time: 10, lat: 25, lon: 53)>
array([[[241.2    , 242.5    , ..., 235.5    , 238.59999],
        [243.79999, 244.5    , ..., 235.29999, 239.29999],
        ...,
        [295.9    , 296.19998, ..., 295.9    , 295.19998],
        [296.29   , 296.79   , ..., 296.79   , 296.6    ]],

       [[242.09999, 242.7    , ..., 233.59999, 235.79999],
        [243.59999, 244.09999, ..., 232.5    , 235.7    ],
        ...,
        [296.19998, 296.69998, ..., 295.5    , 295.1    ],
        [296.29   , 297.19998, ..., 296.4    , 296.6    ]],

       ...,

       [[244.79999, 244.39   , ..., 242.7    , 244.79999],
        [246.7    , 247.09999, ..., 237.79999, 240.2    ],
        ...,
        [297.79   , 297.19998, ..., 296.4    , 295.29   ],
        [297.9    , 297.69998, ..., 297.19998, 297.     ]],

       [[243.89   , 243.79999, ..., 240.29999, 242.59999],
        [245.5    , 245.79999, ..., 236.59999, 239.     ],
        ...,
        [297.6    , 297.     , ..., 295.6    , 295.     ],
        [298.1    , 297.69998, ..., 296.79   , 297.1    ]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2013-01-03T06:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

You can restore the original data by connecting in the time direction.

xr.concat([da0, da1], dim='time')
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
array([[[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    ,
         238.59999],
        [243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999,
         239.29999],
        [250.     , 249.79999, 248.89   , ..., 233.2    , 236.39   ,
         241.7    ],
        ...,
        [296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    ,
         294.69998],
        [295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    ,
         295.19998],
        [296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   ,
         296.6    ]],

       [[242.09999, 242.7    , 243.09999, ..., 232.     , 233.59999,
         235.79999],
        [243.59999, 244.09999, 244.2    , ..., 231.     , 232.5    ,
         235.7    ],
        [253.2    , 252.89   , 252.09999, ..., 230.79999, 233.39   ,
         238.5    ],
...
        [293.69   , 293.88998, 295.38998, ..., 295.09   , 294.69   ,
         294.29   ],
        [296.29   , 297.19   , 297.59   , ..., 295.29   , 295.09   ,
         294.38998],
        [297.79   , 298.38998, 298.49   , ..., 295.69   , 295.49   ,
         295.19   ]],

       [[245.09   , 244.29   , 243.29   , ..., 241.68999, 241.48999,
         241.79   ],
        [249.89   , 249.29   , 248.39   , ..., 239.59   , 240.29   ,
         241.68999],
        [262.99   , 262.19   , 261.38998, ..., 239.89   , 242.59   ,
         246.29   ],
        ...,
        [293.79   , 293.69   , 295.09   , ..., 295.29   , 295.09   ,
         294.69   ],
        [296.09   , 296.88998, 297.19   , ..., 295.69   , 295.69   ,
         295.19   ],
        [297.69   , 298.09   , 298.09   , ..., 296.49   , 296.19   ,
         295.69   ]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

Also note that the time axis has been restored as well.

Specify a new axis (stack operation)

For example, suppose you have two DataArrays like this:

da0 = da.isel(time=0)
da1 = da.isel(time=1)
da0
<xarray.DataArray 'air' (lat: 25, lon: 53)>
array([[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    , 238.59999],
       [243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999, 239.29999],
       [250.     , 249.79999, 248.89   , ..., 233.2    , 236.39   , 241.7    ],
       ...,
       [296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    , 294.69998],
       [295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    , 295.19998],
       [296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   , 296.6    ]],
      dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    time     datetime64[ns] 2013-01-01
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

To combine this along the time axis, do the following:

xr.concat([da0, da1], dim='time')
<xarray.DataArray 'air' (time: 2, lat: 25, lon: 53)>
array([[[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    ,
         238.59999],
        [243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999,
         239.29999],
        [250.     , 249.79999, 248.89   , ..., 233.2    , 236.39   ,
         241.7    ],
        ...,
        [296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    ,
         294.69998],
        [295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    ,
         295.19998],
        [296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   ,
         296.6    ]],

       [[242.09999, 242.7    , 243.09999, ..., 232.     , 233.59999,
         235.79999],
        [243.59999, 244.09999, 244.2    , ..., 231.     , 232.5    ,
         235.7    ],
        [253.2    , 252.89   , 252.09999, ..., 230.79999, 233.39   ,
         238.5    ],
        ...,
        [296.4    , 295.9    , 296.19998, ..., 295.4    , 295.1    ,
         294.79   ],
        [296.19998, 296.69998, 296.79   , ..., 295.6    , 295.5    ,
         295.1    ],
        [296.29   , 297.19998, 297.4    , ..., 296.4    , 296.4    ,
         296.6    ]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 2013-01-01T06:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

In this way, if there is a scalar coordinate called time in da0, da1, connect it in that direction. You can make the time axis a new dimension coordinate (by specifying a name in the dim keyword, such as dim ='time').

Alternatively, you can create a new coordinate axis for that name by giving the dim keyword a DataArray.

File IO

Another advantage of xarray is that it can store data such as DataArray and Dataset consistently and self-contained. It supports several file formats, but the easiest to use is probably netCDF.

Wikipedia also has an explanation.

NetCDF is a computer model-independent binary format (model-independent) that can read and write data as an array (array-oriented), and can store data as well as explanations about that data (self-descriptive). ). NetCDF is an international standard in the Open Geospatial Consortium, an international consortium that develops open geographic information standards.

Currently, versions 3 and 4 are in maintenance, but it's safer to use version 4. Version 4 conforms to the HDF5 standard, so even a regular HDF5 reader can read the contents.

The official extension is .nc.

Save / load files in netCDF format

Package netcdf4 is required to save in netCDF format.

pip install netcdf4

Or

conda install netcdf4

Let's run and install it on your system.

To save the xarray object in netCDF format, execute the .to_netcdf method.

data.to_netcdf('test.nc')
/home/keisukefujii/miniconda3/envs/xarray/lib/python3.7/site-packages/ipykernel_launcher.py:1: SerializationWarning: saving variable air with floating point data as an integer dtype without any _FillValue to use for NaNs
  """Entry point for launching an IPython kernel.

You should now have test.nc saved in your current path.

Conversely, to load the saved file, do xr.load_dataset.

xr.load_dataset('test.nc')
<xarray.Dataset>
Dimensions:           (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat               (lat) float32 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0
  * lon               (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0
  * time              (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air               (time, lat, lon) float32 241.2 242.5 ... 296.19 295.69
    mean_temperature  (lat, lon) float32 260.37564 260.1826 ... 297.30502
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

As you can see, not only the parts of the array, but also the axis names, coordinate values, and other information in attributes are stored.

A note about scipy.io.netcdf

The xarray .to_netcdf uses the netcdf4 library by default, but uses scipy.io.netcdf if it is not installed on your system. However, scipy.io.netcdf supports version 3, which is incompatible with version 4 and can be confusing.

Don't forget to install netcdf4 on your system. Alternatively, you can specify the package to use explicitly, such as .to_netcdf ('test.nc', engine ='netcdf4').

Out-of-memory array

Files in netCDF format can be read randomly. Therefore, if you want to use only the data of a certain part of the file, it is efficient to read all the data when necessary instead of reading it from the hard disk first. This is inherently important, especially when dealing with files that are too large to fit in memory. This is called out-of-memory processing.

Out-of-memory processing in xarray is positioned as a fairly important yesterday, and various implementations are in progress. I'll cover the details in another article, and I'll just mention the most basic ones here.

Out-of-memory processing is performed by doing open_dataset instead of load_dataset when loading.

unloaded_data = xr.open_dataset('test.nc')
unloaded_data
<xarray.Dataset>
Dimensions:           (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat               (lat) float32 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0
  * lon               (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0
  * time              (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air               (time, lat, lon) float32 ...
    mean_temperature  (lat, lon) float32 ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

If you look at Data variables, for example, you can see that there are no numbers in air, just ....

This indicates that no data has been read. However, since the coordinate axes are always read, the above indexing work can be performed as it is.

unloaded_data['air'].sel(lat=60, method='nearest')
<xarray.DataArray 'air' (time: 2920, lon: 53)>
[154760 values with dtype=float32]
Coordinates:
    lat      float32 60.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

When I select a variable or index it, it still doesn't load. It will be loaded into memory for the first time when it is operated on or when .data or .values is called.

You can also use the .compute () method to load it explicitly.

unloaded_data.compute()
<xarray.Dataset>
Dimensions:           (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat               (lat) float32 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0
  * lon               (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0
  * time              (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air               (time, lat, lon) float32 241.2 242.5 ... 296.19 295.69
    mean_temperature  (lat, lon) float32 260.37564 260.1826 ... 297.30502
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Note

open_dataset, open_dataarray locks the file for future reading. Therefore, if you touch a file with the same name in another program, an error will occur.

For example, consider the case where the result of a time-consuming calculation is output as a .nc file each time. If you want to see the progress and open it with open_dataset, then any subsequent attempts to write to that file will fail. It's better if it just fails and stops, but it seems that it often destroys existing data. This out-of-memory process should be used with caution.

You can use the with statement to ensure that the file is closed and unlocked, as shown below.

with xr.open_dataset('test.nc') as f:
  print(f)
<xarray.Dataset>
Dimensions:           (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat               (lat) float32 75.0 72.5 70.0 67.5 ... 22.5 20.0 17.5 15.0
  * lon               (lon) float32 200.0 202.5 205.0 ... 325.0 327.5 330.0
  * time              (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air               (time, lat, lon) float32 ...
    mean_temperature  (lat, lon) float32 ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Recommended data structure

As mentioned above, if you save the data in netcdf format, you can analyze the data smoothly with xarray. When creating data for experiments or simulations, it is a good idea to create a data structure that considers the subsequent analysis and save it in the netcdf format including it.

It is recommended to keep the following data suppressed.

In the case of an experiment, I think that it should be saved with the following structure.

from datetime import datetime
#Let's assume that this is image data taken with a camera.
raw_data = np.arange(512 * 2048).reshape(512, 2048)
#The x-axis and y-axis coordinates of the image.
x = np.arange(512)
y = np.arange(2048)

#Position data for each pixel of the image
r = 300.0
X = r * np.cos(x / 512)[:, np.newaxis] + r * np.sin(y / 2048)
Y = r * np.sin(x / 512)[:, np.newaxis] - r * np.cos(y / 2048)

#Parameters used in the experiment
p0 = 3.0
p1 = 'normal'

#Experiment time
now = datetime.now()

data = xr.DataArray(
    raw_data, dims=['x', 'y'],
    coords={'x': x, 'y': y, 
            'X': (('x', 'y'), X),
            'Y': (('x', 'y'), Y),
            'p0': p0, 'p1': p1,
            'datetime': now},
    attrs={'camera type': 'flash 4', 
           'about': 'Example of the recommended data structure.'})
data
<xarray.DataArray (x: 512, y: 2048)>
array([[      0,       1,       2, ...,    2045,    2046,    2047],
       [   2048,    2049,    2050, ...,    4093,    4094,    4095],
       [   4096,    4097,    4098, ...,    6141,    6142,    6143],
       ...,
       [1042432, 1042433, 1042434, ..., 1044477, 1044478, 1044479],
       [1044480, 1044481, 1044482, ..., 1046525, 1046526, 1046527],
       [1046528, 1046529, 1046530, ..., 1048573, 1048574, 1048575]])
Coordinates:
  * x         (x) int64 0 1 2 3 4 5 6 7 8 ... 504 505 506 507 508 509 510 511
  * y         (y) int64 0 1 2 3 4 5 6 7 ... 2041 2042 2043 2044 2045 2046 2047
    X         (x, y) float64 300.0 300.1 300.3 300.4 ... 414.7 414.8 414.9 414.9
    Y         (x, y) float64 -300.0 -300.0 -300.0 -300.0 ... 89.66 89.79 89.91
    p0        float64 3.0
    p1        <U6 'normal'
    datetime  datetime64[ns] 2020-11-13T03:31:56.219372
Attributes:
    camera type:  flash 4
    about:        Example of the recommended data structure.

By collecting the things that can change during the experiment in scalar coordinates, it will be easier to analyze later when you want to know the dependencies on them.

Other useful functions / methods

Reduction .sum, .mean, etc

Reduction processes such as np.sum and np.mean are provided as methods. For example, np.sum allows you to specify in which axis direction to add. Similarly, xarray allows you to specify an axis as the axis name.

data = xr.tutorial.open_dataset('air_temperature')['air']

data.sum('lat')
<xarray.DataArray 'air' (time: 2920, lon: 53)>
array([[6984.9497, 6991.6606, 6991.5303, ..., 6998.77  , 7007.8804,
        7016.5605],
       [6976.4307, 6988.45  , 6993.2407, ..., 6994.3906, 7006.7505,
        7019.941 ],
       [6975.2603, 6982.02  , 6988.77  , ..., 6992.0503, 7004.9404,
        7020.3506],
       ...,
       [6990.7505, 6998.3496, 7013.3496, ..., 6995.05  , 7008.6504,
        7019.4497],
       [6984.95  , 6991.6504, 7007.949 , ..., 6994.15  , 7008.55  ,
        7020.8506],
       [6981.75  , 6983.85  , 6997.0503, ..., 6985.6494, 6999.2495,
        7012.0493]], dtype=float32)
Coordinates:
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00

If you add up in a certain axial direction in this way, there will be no coordinate axes in that direction. The other axes remain.

Note that when executing .sum etc., by default np.nansum is called instead of np.sum, and the calculation skipping Nan is performed. If you don't want to skip Nan (if you want to use np.sum instead of np.nansum), you need to give skipna = False.

Drawing .plot

DataArray has a plot method, which makes it easy to visualize the contents of the data.

data.sel(time='2014-08-30 08:00', method='nearest').plot()
<matplotlib.collections.QuadMesh at 0x7f9a98557310>

output_102_1.png

It may be better to think of it as a simple one rather than to make a final diagram of a paper / report, but in order to visualize and understand the data with Jupyter, Ipython, etc. and proceed with the analysis. Very convenient to use.

Recommended Posts

Data analysis using xarray
Data analysis using Python 0
Multidimensional data analysis library xarray
Data analysis using python pandas
Recommendation of data analysis using MessagePack
Multidimensional data analysis library xarray Part 2
Data analysis Titanic 2
Data analysis python
Data analysis Titanic 1
Data analysis Titanic 3
Creating a data analysis application using Streamlit
Data analysis with python 2
Data analysis parts collection
Data analysis overview python
Median filter using xarray (median filter)
Data cleansing 2 Data cleansing using DataFrame
Data cleaning using Python
Python data analysis template
Orthologous analysis using OrthoFinder
Data analysis with Python
[Python] [Word] [python-docx] Simple analysis of diff data using python
Big data analysis using the data flow control framework Luigi
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
My python data analysis container
Python for Data Analysis Chapter 4
Japanese morphological analysis using Janome
Select features using text data
[Python] Notes on data analysis
Python data analysis learning notes
Data visualization method using matplotlib (1)
Data visualization method using matplotlib (2)
Python for Data Analysis Chapter 2
Wrap analysis part1 (data preparation)
Tips for data analysis ・ Notes
Python for Data Analysis Chapter 3
Analyzing Twitter Data | Trend Analysis
[Introduction] Artificial satellite data analysis using Python (Google Colab environment)
I tried to analyze scRNA-seq data using Topological Data Analysis (TDA)
First satellite data analysis by Tellus
Get Salesforce data using REST API
Data acquisition using python googlemap api
Precautions when using TextBlob trait analysis
Data visualization method using matplotlib (+ pandas) (5)
Python: Time Series Analysis: Preprocessing Time Series Data
Parsing CSV format data using SQL
Face recognition using principal component analysis
Get Amazon data using Keep API # 1 Get data
Data visualization method using matplotlib (+ pandas) (3)
Preprocessing template for data analysis (Python)
Data acquisition memo using Backlog API
November 2020 data analysis test passing experience
Data analysis for improving POG 3-Regression analysis-
Japanese analysis processing using Janome part1
Image binarization using linear discriminant analysis
Time series analysis 3 Preprocessing of time series data
Get data from Twitter using Tweepy
Data analysis starting with python (data visualization 1)
Data visualization method using matplotlib (+ pandas) (4)
Data analysis starting with python (data visualization 2)
Recommendation tutorial using association analysis (concept)
Data handling 2 Analysis of various data formats