[PYTHON] How to read time series data in PyTorch

Overview

This time, I summarized the method of sequence data input in PyTorch. I'm sure there are many aspects that cannot be reached, but I would appreciate any technical guidance. What you can understand in this article is how to read a dataset in PyTorch as a chunk of fixed-length moving images. In particular, it is assumed that data sets such as UCSD DATASET, which are not saved as moving images but are saved as serial number images for each folder, are handled.

DATASET/  ├ train/ │ └ img_0001.png ← 1st frame of video │ └ img_0002.png ← 2nd frame of video  │ └ img_0003.png      :  │     :  └ test/

I wanted to use PyTorch to learn LSTM without supervised learning, but I didn't have a video load module (in my research), so I reluctantly decided to make it myself.

** Assuming that the data set in image format is first read, a moving image (partial time series) with a fixed length is created from it, the batch size is solidified, and the LSTM is trained. I will. ** **

1. How to read datasets in PyTorch

In PyTorch, Dataset and DataLoader classes for reading the training dataset are prepared, and the data existing in the dir given at the time of object declaration is prepared for batch size for each epoch, so at the time of learning Very convenient. If you refer to here, there are the following three characters related to reading.

--Module in charge of data preprocessing

--A module that returns a set of data and the corresponding label --Returns preprocessed data using transforms when returning data.

--A module that returns data from a dataset to a batch size

Generally, in transforms, set the preprocessing (standardization, size conversion, etc.) of the dataset, then use Dataset to apply the association with the label and preprocessing, and finally use DataLoader to set the batch size. I think that it will be a flow of returning it as a lump of. However, this is only the case if the dataset input is i.i.d., which is a problem if you want to input ** sequence data. ** ** I want a module that can handle sequence data, especially moving image data, so I thought about it.

2. Inheritance / extension of Dataset class

First of all, since the base is the Dataset class, we inherit this and declare a sub class (Seq_Dataset: SD) with Ds as the parent class (superclass). Only the method you want to change will be described on the SD again. (Undefined methods are automatically overridden.) Basically, when you extend the Dataset class and extend it, you will write changes to __len__ and __getitem__. In particular, in __getitem__, describe the processing (moving image conversion) for the read Dataset object (image data this time).

The flow assumed this time is ** Preprocessing setting with transform → Image data reading and processing with ImageFolder (Dataset) → Finally, Seq_Dataset creates a fixed-length moving image (partial time series) and returns the batch size of it ** ..

Below is the SD class that extends Ds this time. I will briefly explain each function.

dataset.py


import torch
from torch.utils.data import Dataset

class Seq_Dataset(Dataset):
    def __init__(self, datasets ,time_steps):
        self.dataset = datasets
        self.time_steps = time_steps
        channels = self.dataset[0][0].size(0)
        img_size = self.dataset[0][0].size(1)
        video = torch.Tensor()
        self.video = video.new_zeros((time_steps,channels,img_size,img_size))

    def __len__(self):
        return len(self.dataset)-self.time_steps

    def __getitem__(self, index):
        for i in  range(self.time_steps):
            self.video[i] = self.dataset[index+i][0]
        img_label =self.dataset[index]
        return self.video,img_label

In __init__, we simply define the necessary variables. This time, I took a fixed length, that is, time_steprs as an argument. Also, the variable called video is a tensor that stores a fixed-length partial time series, and is initialized with 0. It will be in the form of storing the image data in the datasets that I wrote here.

In __len__, it only returns the total number of data. This time, the read image data is finally returned as a fixed-length moving image, so the total number is len (dataset) -time_steps.

In __getitem__, a partial time series for time_steps is generated and assigned to video and returned. Here you can also describe level operations on images. Since there is a background of unsupervised learning this time, there is an outrage that the value of the image is substituted as it is without specifying anything about label. Regarding the method of specifying the label, I think that there are many reference examples if you refer to others. (Sorry for this application)

3. Example of using Seq_Dataset

When actually training, I think that it will be in the form of using the data_loader object and turning it with for to train the model. The procedure to get data_loader is as follows, define each variable, and follow the flow of ImageFolder → Seq_Dataset → DataLoader.

main.py


  data_dir = "./data/train/"
  time_steps = 10
  num_workers = 4

  dataset = datasets.ImageFolder(data_dir, transform=transform)
  data_ = dataset.Seq_Dataset(dataset, time_steps)
  data_loader = DataLoader(data_, batch_size=opt.batch_size, shuffle=True, num_workers=num_workers)

The shape of the partial time series tensor that is finally output has [batchsize, timeouts, channels, imgsize, imgsize]. In the future, I would like to publish the LSTM implementation in PyTorch using this self-made module. Thank you for watching until the end.

Recommended Posts

How to read time series data in PyTorch
How to generate exponential pulse time series data in python
How to handle time series data (implementation)
<Pandas> How to handle time series data in a pivot table
How to Data Augmentation with PyTorch
How to call PyTorch in Julia
How to compare time series data-Derivative DTW, DTW-
How to read CSV files in Pandas
How to read problem data with paiza
Books on data science to read in 2020
How to extract features of time series data with PySpark Basics
How to calculate the sum or average of time series csv data in an instant
Get time series data from k-db.com in Python
How to read PyPI
How to use Python Image Library in python3 series
[Python] How to read data from CIFAR-10 and CIFAR-100
How to read JSON
How to create data to put in CNN (Chainer)
How to read a file in a different directory
How to apply markers only to specific data in matplotlib
How to measure processing time in Python or Java
How to read csv containing only integers in Python
[Python] Plot time series data
How to develop in Python
How to handle data frames
[Go language] How to get terminal input in real time
[Introduction to matplotlib] Read the end time from COVID-19 data ♬
Graph time series data in Python using pandas and matplotlib
How to get an overview of your data in Pandas
Data science companion in python, how to specify elements in pandas
I have read 10 books related to time series data, so I will write a book review.
[Question] How to get data of textarea data in real time using Python web framework bottle
[Python] How to FFT mp3 data
[Python] How to do PCA in Python
How to handle session in SQLAlchemy
Read Protocol Buffers data in Python3
Python: Time Series Analysis: Preprocessing Time Series Data
How to use classes in Theano
How to write soberly in pandas
How to collect images in Python
How to update Spyder in Anaconda
How to use SQLite in Python
How to deal with imbalanced data
How to read the SNLI dataset
How to convert 0.5 to 1056964608 in one shot
Try to put data in MongoDB
How to reflect CSS in Django
How to kill processes in bulk
How to wrap C in Python
How to use ChemSpider in Python
About time series data and overfitting
How to use PubChem in Python
Example of how to aggregate a large amount of time series data using Python at a reasonable speed in a small memory environment
Differentiation of time series data (discrete)
Time series analysis 3 Preprocessing of time series data
How to run TensorFlow 1.0 code in 2.0
How to handle Japanese in Python
How to log in to Docker + NGINX
How to collect machine learning data
How to write offline real time Solve E04 problems in Python
[Django] How to read variables / constants defined in an external file