[PYTHON] About data preprocessing of systems that use machine learning

First edition: 2020/3/3
Authors: Soichi Takashige, Masahiro Ito, Hitachi, Ltd.

Introduction

In this post, we will introduce the design know-how of data preprocessing and the performance verification results of numerical data preprocessing when designing a system that incorporates a machine learning model.

In the first installment, we will introduce the data preprocessing of machine learning systems and the outline of their design.

** Post list: **

  1. About data preprocessing of systems that use machine learning (this post)
  2. Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
  3. Performance verification of data preprocessing for machine learning (numerical data) (Part 2)

Outline and flow of AI projects

Data analysis using AI technology such as machine learning is attracting attention, and the number of projects using AI is increasing. In AI projects, we analyze customer data to create machine learning models for gaining some knowledge and automating forecasts. In many cases, AI projects are 1) conducted PoC (Proof of Concept) led by data scientists who are experts in data analysis, and confirmed the usefulness of data analysis and machine learning models, and 2) the results. In addition, the system engineer (hereinafter referred to as SE) will make the production system, which is carried out in two stages. In addition to the report submitted to the customer, the PoC deliverables by the data scientist have source code written in a language such as Python, which SE designs and systematizes.

Challenges in systematization

One of the challenges of systematization based on PoC deliverables is the increase in the amount of data in the production environment. Since PoC is just a verification, the amount of data entrusted to us by customers is often small, and even if there is a large amount of data, we often sample and use the data for quick verification. As you can see, PoC typically uses a small amount of data that can be processed on a single desktop machine. On the other hand, since machine learning systems perform learning based on data, the larger the amount of data, the higher the prediction accuracy. Therefore, when making a production system, a large amount of data is often used to improve prediction accuracy, and machine learning systems are required to have high data processing performance.

In addition, while data scientists handle machine learning models, data preprocessing is generally designed mainly by engineers such as SE. Data preprocessing is a technology closely related to machine learning models, but it also requires knowledge as an SE, such as infrastructure design, sizing, and failure handling in the event of a failure. At that time, it is not efficient to redesign and reimplement the preprocessing from scratch on the SE side at a high cost and time. Therefore, an approach that promotes systemization by utilizing prototypes developed by data scientists in Python etc. is effective. However, there is not much public information currently available on such design know-how. In this post, we will introduce the design procedure and points of data preprocessing based on the knowledge obtained from the results of performance verification for SEs who have completed the PoC stage and are designing a system that incorporates a machine learning model. I will.

Overview of data preprocessing in machine learning systems

Data preprocessing in machine learning systems

In a system that uses machine learning, it is conceivable to perform data preprocessing mainly in three phases: learning, inference, and relearning. Figure 1 gives an overview. 前処理システム概要
Figure 1 Overview of machine learning utilization system

  1. Data preprocessing during learning

Data sets stored in various formats are converted into a data structure suitable for learning a certain model, and data is normalized and aggregated to improve accuracy. In the pre-processing at the time of learning, there are many cases where all the data is processed at once, and the amount of processing tends to be very large compared to the time of inference.

  1. Data preprocessing during inference

Inference uses models to classify and predict trends on data collected from the field of production. Inference may impose latency requirements, such as seconds, in which case preprocessing will also be subject to latency requirements. Data preprocessing at the time of inference is a process that converts so that it has the same features as at the time of learning, but it is performed only on the data to be inferred, and the process may be simplified. On the other hand, because it is necessary to execute processing each time the model is used, it is frequently called continuously in the production system.

  1. Data preprocessing during retraining

If data that can be used for learning is accumulated from the actual data at the operation site, it may be used to update the model. In such a case, the same processing as the data preprocessing at the time of learning will be performed.

Issues and solutions in data preprocessing design

Since PoC by data scientists is often performed with a small amount of data, preprocessing is often implemented in Python during PoC. On the other hand, if you try to implement preprocessing in Python during systemization, the issues shown in Table 1 below will occur. Table 1 also shows a solution to that problem.

Table 1 Data preprocessing system issues and solutions

# Task Solution plan
Preprocessing takes a very long time due to the huge amount of data to be targeted. Utilization of big data processing infrastructureDesigned and implemented so that preprocessing written in Python etc. can be executed in parallel and distributed on a processing platform such as Spark when the amount of data is large.
If the big data processing platform is used in (1), it will be necessary to reimplement the preprocessing, which will take man-hours. Pre-processing implementation with a view to systemization from the PoC stageA Python processing implementation method that operates without major changes when converting to Spark is adopted and implemented at PoC.

System design process that utilizes machine learning

Design a system that utilizes machine learning as shown in Figure 2. As mentioned at the beginning of this post, it is common to carry out PoC to confirm the usefulness of machine learning and systematize the logic whose usefulness has been confirmed there.

プロセス Figure 2 Outline of the design process of a system that utilizes machine learning

As shown in Fig. 1, there are two major systems that utilize machine learning: learning systems and inference systems. From now on, we will deal with learning systems. In general, inference systems are compared to learning systems.

There is a tendency.

Pretreatment design during learning

Table 2 shows the design items when designing and implementing the data preprocessing of the learning system. Only the parts that are characteristic of machine learning are shown here.

Table 2 List of design items for learning system

# Design items Details
1 Examination of system requirements
  • Throughput requirements
  • Total run time requirements
  • availability/Requirements for recovery processing
  • Estimating the total number of pretreatment types
2 Data design
  • Data placement design
  • Intermediate data placement design
3 Resource estimation by actual machine
  • Analyzing PoC Code on Small Data Sets
  • Check the number of data
  • Data size per record
  • Estimating the degree of sparseness of data
  • Total data size estimate
  • Training data size and intermediate data estimate to save
4 Implementation
  • Selection of execution platform(Python,Spark and others)
  • Estimating the number of nodes
  • Estimating the amount of memory per node
  • Determining data processing method
  • Python、Sparkでの処理共通化を意識したImplementation
  • Spark support for Python code
5 Availability design
  • Designation of switchback and re-execution method when processing fails
6 Operational design
  • Design of data preprocessing update accompanying model update
  • Data history management after processing, deletion design

Resource estimate

In the learning system, the most important point is whether the model development using the target data can be completed within the period required by the system requirements. In the PoC phase, the target data is often handled only for a part of the period (devices, forms, etc.), but in the subsequent systemization, the entire period and all types of data are handled. This can result in a huge amount of data. Regarding pre-processing, it is important to estimate the resources (number of CPUs, amount of memory) required for systemization design so that processing can be completed within an acceptable processing time for large data.

When finding the resources required for preprocessing, first check the input / output data size and processing time using a small data set for each process in the preprocessing. Also, check the processes (processes with a large number of repetitions) that can be expected to be effective in optimizing the processing logic, which will be described in Part 2 and later.

Regarding the number of CPUs among the resources, the processing time of the production system is estimated from the input data size of the production system based on the input data size and processing time when using a small data set (at this time, the processing time is data). It shall be proportional to the size). By dividing this time by the processing time specified in the system requirements, you can estimate the approximate number of CPUs you will need.

Regarding the amount of memory, the data size of each process in the production system is estimated based on the input data size of each process when using a small data set, and the total size is estimated.

Selection of data processing infrastructure

As a basis for executing data preprocessing in a production preprocessing system, it is possible to determine whether to build an environment in which PoC code written in Python is executed as it is in Python, or whether it should be executed in a distributed processing platform such as Spark. Is required. Basically, as a result of estimating the required resources, if the memory seems to be insufficient, Spark will be preprocessed, and if the memory amount is not a problem, Python will be preprocessed as it is.

in conclusion

In this post, I introduced the outline of data preprocessing and its design of a system that uses machine learning. Next time, we will introduce the know-how for improving the performance of numerical data preprocessing using Python, and the performance verification results on actual machines.

The second: Performance verification of preprocessing by machine learning of numerical data (1)

Recommended Posts

About data preprocessing of systems that use machine learning
Preprocessing in machine learning 2 Data acquisition
Preprocessing in machine learning 4 Data conversion
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Python: Preprocessing in machine learning: Data acquisition
Python: Preprocessing in machine learning: Data conversion
Preprocessing in machine learning 1 Data analysis process
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
List of links that machine learning beginners are learning
About the development contents of machine learning (Example)
I started machine learning with Python Data preprocessing
Data cleansing 3 Use of OpenCV and preprocessing of image data
A story about data analysis by machine learning
About machine learning overfitting
Preprocessing of prefecture data
Python practice data analysis Summary of learning that I hit about 10 with 100 knocks
About testing in the implementation of machine learning models
Data set for machine learning
Japanese preprocessing for machine learning
About machine learning mixed matrices
Importance of machine learning datasets
About data management of anvil-app-server
Survey on the use of machine learning in real services
Made icrawler easier to use for machine learning data collection
Python application: Data cleansing # 3: Use of OpenCV and preprocessing of image data
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
Significance of machine learning and mini-batch learning
[Machine learning] Check the performance of the classifier with handwritten character data
About Boxplot and Violinplot that visualize the variability of independent data
Machine learning ③ Summary of decision tree
About learning method with original data of CenterNet (Objects as Points)
Machine learning in Delemas (data acquisition)
How to use machine learning for work? 01_ Understand the purpose of machine learning
Python: Preprocessing in Machine Learning: Overview
Time series analysis 3 Preprocessing of time series data
Basic machine learning procedure: ② Prepare data
How to collect machine learning data
[Updated Ver1.3.1] I made a data preprocessing library DataLiner for machine learning.
How to use machine learning for work? 02_Overview of AI development project
Summary of mathematical scope and learning resources required for machine learning and data science
Machine learning algorithm (generalization of linear regression)
Machine learning imbalanced data sklearn with k-NN
Use machine learning APIs A3RT from Python
A story about machine learning with Kyasuket
2020 Recommended 20 selections of introductory machine learning books
Machine learning algorithm (implementation of multi-class classification)
Personal notes and links about machine learning ① (Machine learning)
[Python] First data analysis / machine learning (Kaggle)
[Machine learning] List of frequently used packages
About data expansion processing for deep learning
Judgment of igneous rock by machine learning ②
Free version of DataRobot! ?? Introduction to "PyCaret", a library that automates machine learning
A story about achieving a horse racing recovery rate of over 100% through machine learning
Machine learning
Implementation of a model that predicts the exchange rate (dollar-yen rate) by machine learning
Predict short-lived works of Weekly Shonen Jump by machine learning (Part 1: Data analysis)
Align the number of samples between classes of data for machine learning with Python
Machine learning memo of a fledgling engineer Part 1
Classification of guitar images by machine learning Part 1