[PYTHON] Data supply tricks using deques in machine learning

When dealing with a large number of image files in a machine learning task, it is likely that the file names will be listed and the images will be read sequentially in the train process. However, due to the relationship between the number of samples prepared and the mini-batch size in the learning process, fractions are inevitably generated in the latter half of learning, and handling tends to be complicated.

For example, if the number of data samples is num = 100 and the mini-batch size batch_size = 30

  1. Run the mini-batch 3 times and do not use 10 samples.
  2. Adjust the size in the next mini-batch (batch_size = 10).
  3. In the next mini-batch size, if the number of samples (20) is insufficient, the sample used once is reused.

The method of is conceivable. If the number of samples is large, the above option 1 is fine, but if you want to use the training data carefully, you will want to select options 2 and 3.

Here, we will implement the method of option 3 with deque.

What is a deque

The explanation is quoted from the introductory Python3.

deque (pronounced deque) is a deque, which has the functions of stack and queue. This is useful when you want to be able to add or remove elements at either end of the sequence.

This explanation is illustrated below.

deque_image.png

This time, the function of deque.rotate () was used for "reuse of data sample". (The process of "rotate if data is used, rotate if used, ..." is performed.)

Implementation

Consider the case where the data file is expanded as follows.

$ ls deque_ex/*.jpg
deque_ex/img_0.jpg  deque_ex/img_3.jpg  deque_ex/img_6.jpg  deque_ex/img_9.jpg
deque_ex/img_1.jpg  deque_ex/img_4.jpg  deque_ex/img_7.jpg
deque_ex/img_2.jpg  deque_ex/img_5.jpg  deque_ex/img_8.jpg

First, make a list (deck) of the file names to be handled.

import glob
from collections import deque
import numpy as np

def mk_list():
    fname_list = glob.glob('*.jpg')
    sorted_fn = sorted(fname_list)
    deq_fname = deque()
    deq_fname.extend(sorted_fn)   # 'extend' is right, 
                                  # 'append' is not good.
    
    return deq_fname

The point is to use deque.extend () instead of deque.append () when adding the list to the deck.

>>>
deque(['img_0.jpg',
       'img_1.jpg',
       'img_2.jpg',
       'img_3.jpg',
       'img_4.jpg',
       'img_5.jpg',
       'img_6.jpg',
       'img_7.jpg',
       'img_8.jpg',
       'img_9.jpg'])

From the data list (to be exact, deque class) and the number of requests, the function that returns the data is as follows. (Use list slices and deque.rotate ().)

def feed_fn_ver0(dq, num):
    feed = list(dq)[-num:]
    dq.rotate(num)
    
    return feed

The situation where 3 samples of data were taken out 5 times using this is as follows.

0: ['img_7.jpg', 'img_8.jpg', 'img_9.jpg']
1: ['img_4.jpg', 'img_5.jpg', 'img_6.jpg']
2: ['img_1.jpg', 'img_2.jpg', 'img_3.jpg']
3: ['img_8.jpg', 'img_9.jpg', 'img_0.jpg']
4: ['img_5.jpg', 'img_6.jpg', 'img_7.jpg']

We were able to retrieve 3 samples from the end of the data deck. There is no problem in using it in a machine learning process that does not care about the order, but since "from the end" is a little unpleasant, I corrected it to "from the beginning" and checked the required data length next. Code.

def feed_fn_ver1(dq, num):
    '''
      dq  : data source (deque)
      num : request size (int)
    '''
    # check length
    assert num <= len(dq)
   
    feed = list(dq)[:num]
    dq.rotate(-num)

    return feed

my_list = mk_list()
for i in range(5):
    print(' Feed [', i, ']: ', feed_fn_ver1(my_list, 3))
    
>>>
 Feed [ 0 ]:  ['img_0.jpg', 'img_1.jpg', 'img_2.jpg']
 Feed [ 1 ]:  ['img_3.jpg', 'img_4.jpg', 'img_5.jpg']
 Feed [ 2 ]:  ['img_6.jpg', 'img_7.jpg', 'img_8.jpg']
 Feed [ 3 ]:  ['img_9.jpg', 'img_0.jpg', 'img_1.jpg']
 Feed [ 4 ]:  ['img_2.jpg', 'img_3.jpg', 'img_4.jpg']

It worked fine. The randomly shuffled data is as follows.

def mk_list_shuffle():
    fname_list = glob.glob('*.jpg')
    np_list_fn = np.array(fname_list)
    np.random.shuffle(np_list_fn)
    deq_fname = deque()
    deq_fname.extend(list(np_list_fn))
    
    return deq_fname

my_list = mk_list_shuffle()
for i in range(5):
    print(' Feed [', i, ']: ', feed_fn_ver1(my_list, 3))

>>>
 Feed [ 0 ]:  ['img_9.jpg', 'img_7.jpg', 'img_6.jpg']
 Feed [ 1 ]:  ['img_1.jpg', 'img_8.jpg', 'img_3.jpg']
 Feed [ 2 ]:  ['img_4.jpg', 'img_0.jpg', 'img_2.jpg']
 Feed [ 3 ]:  ['img_5.jpg', 'img_9.jpg', 'img_7.jpg']
 Feed [ 4 ]:  ['img_6.jpg', 'img_1.jpg', 'img_8.jpg']

It's a little difficult to understand, but when you squint, you can see that the data can be supplied by circulating properly. Using this function (feed_fn_ver1 ()), the machine learning training process should be simple to write.

(The above code was confirmed in the environment of Python 2.7.11 and Python 3.5.1).

References (web site)

--Introduction Python3 http://www.oreilly.co.jp/books/9784873117386/

Recommended Posts

Data supply tricks using deques in machine learning
Machine learning in Delemas (data acquisition)
Preprocessing in machine learning 2 Data acquisition
Preprocessing in machine learning 4 Data conversion
Python: Preprocessing in machine learning: Data acquisition
Python: Preprocessing in machine learning: Data conversion
Preprocessing in machine learning 1 Data analysis process
[Python3] Let's analyze data using machine learning! (Regression)
Data set for machine learning
Machine learning in Delemas (practice)
Used in machine learning EDA
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
Automate routine tasks in machine learning
Classification and regression in machine learning
Python: Preprocessing in Machine Learning: Overview
Random seed research in machine learning
Basic machine learning procedure: ② Prepare data
Application development using Azure Machine Learning
How to collect machine learning data
[Tutorial] Make a named entity extractor in 30 minutes using machine learning
Stock price forecast using machine learning (scikit-learn)
[Machine learning] LDA topic classification using scikit-learn
Machine learning imbalanced data sklearn with k-NN
[python] Frequently used techniques in machine learning
[Machine learning] FX prediction using decision trees
Image recognition model using deep learning in 2016
Get Youtube data in Python using Youtube Data API
Machine learning
I tried to classify guitar chords in real time using machine learning
[Machine learning] Supervised learning using kernel density estimation
[Python] First data analysis / machine learning (Kaggle)
Stock price forecast using machine learning (regression)
[Python] Saving learning results (models) in machine learning
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
SELECT data using client library in BigQuery
[Machine learning] Regression analysis using scikit learn
How to create a face image data set used in machine learning (1: Acquire candidate images using WebAPI service)
A memorandum of method often used in machine learning using scikit-learn (for beginners)
A story about simple machine learning using TensorFlow
Full disclosure of methods used in machine learning
[Machine learning] Supervised learning using kernel density estimation Part 2
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
[Machine learning] Supervised learning using kernel density estimation Part 3
Face image dataset sorting using machine learning model (# 3)
Machine learning Training data division and learning / prediction / verification
Summary of evaluation functions used in machine learning
Get a glimpse of machine learning in Python
I started machine learning with Python Data preprocessing
Stock price forecast using deep learning [Data acquisition]
Try using Jupyter Notebook of Azure Machine Learning
A story about data analysis by machine learning
[For beginners] Introduction to vectorization in machine learning
[Machine learning] Extract similar words mechanically using WordNet
Causal reasoning using machine learning (organization of causal reasoning methods)
How to split machine learning training data into objective variables and others in Pandas
[Memo] Machine learning
Machine learning classification
Machine Learning sample
What I learned about AI / machine learning using Python (1)
Create a data collection bot in Python using Selenium
Build an interactive environment for machine learning in Python