In Part 1 of this four-part series, we'll explore the ML pipeline and highlight the challenges of manual feature extraction.

This blog is a translation from the English version. You can check the original from here. We use some machine translation. We would appreciate it if you could point out any translation errors. *

Alibaba Cloud Community Blog Author [Ahmed F. Gad](https://www.linkedin. From com / in / ahmedfgad? spm = a2c65.11461447.0.0.5bd3e739Qn9U9o)

Through all four tutorials, we learned the artificial neural network (ANN) using the features extracted by transferring the learning of the deep learning model (DL) and convolutional neural network (CNN) learned in advance to Keras, and learned the image. I will explore the classification.

The tutorial begins by exploring the machine learning (ML) pipeline. It emphasizes that manual feature design is a difficult task, especially when there is a large amount of data, and automatic feature extraction using transfer learning is the preferred method. After that, we will introduce transfer learning and explore the advantages of transfer learning and its usage conditions. Next, in this series, we will use Keras running on a Jupyter notebook to use a pre-trained model (MobileNet) trained using the ImageNet dataset in another dataset, which is the Fruits360 dataset. Transfer learning is performed. After that, features are extracted from the model obtained by transfer learning, the extracted features are analyzed to remove bad features, and finally the ANN trained with the features is constructed.

This tutorial is the first part of it, exploring the ML pipeline to highlight the challenges of manual feature extraction. Then we introduce transfer learning to understand it and explain why we can use it.

The points covered in this tutorial are:

--Explore the machine learning pipeline. --Manual feature engineering --Automation of feature extraction using deep learning ――What is transfer learning? ――Why transfer learning? --Use cases where transfer learning is effective --Terms of use for transfer learning.

Let's get started.

Explore the machine learning pipeline

We will discuss and understand the ML pipeline to understand the benefits of transfer learning for pre-trained DL models. This gives you an idea of the core benefits of transfer learning. The following figure shows the pipeline that builds the machine learning model. There is no possibility that other steps such as feature reduction will be added to the pipeline, but the following steps are sufficient to build the model. Let's take a brief look at each step in the pipeline and focus on the feature engineering steps.

Problem definition is understanding the problem you are trying to solve in order to find the best ML technique. This starts with defining the scope of the problem. Is it a monitored issue (classification or regression) or an unmonitored issue (clustering)? After defining the scope of the problem, decide which ML algorithm to use next. For example, for a monitored problem, decide which ML algorithm to use. Linear or non-linear, parametric or nonparametric, etc.

Defining the problem helps prepare the data, which is the next step in the ML pipeline. Machines learn using examples. Each example has an input and an output. For example, if the problem is a classification problem and each sample is classified into one of the predefined categories, the output will be a label. If the output is a regression problem expressed as a continuous line, the output is no longer a label but a number. By defining the problem in this way, we can prepare the data in a proper way.

Manual feature engineering

Once the data is ready, the next step is feature engineering. This is the most important step in building a traditional machine learning model. First, why do you do feature engineering? Feature engineering means transforming data from its current form into another form that helps the problem to be solved. How is data transformed from one form to another? That is to use feature descriptors. Speaking of computer vision, there are various feature descriptors for transforming an image from one shape to another. These descriptors include colors, edges, textures, keypoint descriptors, and more.

Each of these categories has a different type of descriptor. For example, texture descriptors include gray level co-occurrence matrices (GLCM) and local binary patterns (LBP). In addition, key point descriptors include scale-invariant feature conversion (SIFT), accelerated robust feature conversion (SURF), and Harris. Another question comes up here. What type of descriptor is best to use for a given problem?

The data scientist manually decides whether to use the feature descriptor for a given problem. Data scientists begin suggesting some descriptors to use, based on their experience solving problems at hand. Based on the selected descriptor, features are extracted from the image, and then two steps are taken: training the ML algorithm and testing the trained model. Note that the model is the result after training the algorithm.

Data scientists must change descriptors until they find the best choice to reduce the error, as the descriptor choices can be incorrect and test errors in the trained model can be large. Each time you select a new descriptor, the ML algorithm is trained and must be tested again.

In addition to error, there may be other factors to consider when choosing a descriptor, such as computational complexity. Indeed, manually selecting the best descriptors for your needs can be tedious, especially for complex types of problems that analyze thousands or millions of images.

Example for selecting a function

Assuming that each corresponds to a different class, let's apply the above description to select the best type of features for classifying the three images shown in the following figure. What are the most suitable categories of features to use (colors, edges, textures, key points)? It is clear that the colors of these three images are different, so you can use a color descriptor like a color histogram. This will serve your purpose exactly. Once you have built the exact model, move on to the final step in the pipeline, model deployment.

Given that each image corresponds to a different class, what if we add more images to the dataset as follows? It's clear that different images have similar colors. Yes, using only the color histogram may not serve your purpose. So you need to look for other types of descriptors.

Suppose descriptor X was selected and worked well to capture the differences in the images below. It is also possible to use other images that are indistinguishable in Descriptor X. Therefore, instead of descriptor X, we have to find a descriptor X that can capture the difference. This process is repeated as the number of images increases.

The discussion above emphasized that manual feature engineering is a hassle. If manual characterization for a given problem is a hassle, what are the alternatives? That is deep learning, DL for short.

Automation of feature extraction using deep learning

DL is a traditional machine learning automation where the machine itself determines the optimal type of features to use. The following figure compares the traditional ML and DL pipelines. Rather than feature engineering in the ML pipeline, the DL pipeline only supervisions humans as they build the DL architecture. After that, we start learning to automatically find the optimum features to reduce the error as much as possible. The DL algorithm used to recognize multidimensional data such as images is a convolutional neural network (CNN).

DL makes it much easier to find a type of feature that is easy to use, but there are some things to keep in mind. In order for CNN to automatically adapt itself to find the best features, there must be thousands of images used. MobileNet, for example, is a CNN model trained with ImageNet, the largest image recognition dataset on the planet with over 1 million samples. In this way, abundant data is the driving force for building MobileNet. MobileNet shouldn't have been created if such large datasets weren't available. An important question arises here. If you don't have a large dataset to build a DL model from scratch and want to save time trying different feature descriptors to build a traditional ML model, you can automatically extract the features. What should I do? The answer is transfer learning.

You don't need to build a DL model from scratch to use DL. You can take advantage of pre-learned DL model learning and port it to your own problems. The next section describes transfer learning.

What is transfer learning?

Transfer learning is more adaptive than creative. The model is not created from scratch, it is just an adaptation of a pre-trained model to a new problem. If you have a small dataset that is not enough to build a DL model from scratch, mobile learning is an option for automatically extracting features. The following figure emphasizes that.

Prior to transfer training, the DL model is trained on a large dataset with thousands to millions of samples. The training of the DL model trained in this way is transferred using transfer learning, allowing the DL model to work with another small dataset of only hundreds or thousands of images.

Many people ask me if I can use deep learning on datasets with a small sample size. There is no clear answer to such a question, but what I can say is that as the number of samples in the new dataset increases, the accuracy of the model created from transfer learning improves. The new dataset does not have to be as large as the original dataset used to train the DL model, but the larger the sample size, the better. The larger the sample size of the new dataset, the more the model will be customized to work with the new dataset, as shown in the following figure. The reason is that the larger the number of samples, the more customization the pre-trained model parameters will have for the new dataset. As a result, the model obtained by transfer learning can make more accurate predictions than the model with a small sample size.

To gain a lot of knowledge about transfer learning, the next section explains why you should use transfer learning.

Why do you do transfer learning?

There are several reasons to do transfer learning. Here are some of the important reasons.

There is a lack of training data and test data to build a model from scratch.
No need to label the data to grow the dataset. 3, the distribution of data is imbalanced.
Even if the training data is sufficient, learning the DL model from scratch usually requires high processing power and takes time.
Also, even if the training data is sufficient, the test data may not be similar to the training data, or new cases that the training data could not cover may be included in the test data. The model must be retrained with new samples to cover such new cases.
To build a model from scratch, you need to study the problem and have a deep understanding of how things work. Let's discuss these points.

1. Lack of training and test data to build a model from scratch

When building a predictive model, the first task for an ML engineer is to collect as much data as possible to build an accurate model that can handle different cases. Parametric, machine learning algorithms, have a large number of parameters to learn from the data, and for some tasks there is not enough data to help the algorithm learn these parameters correctly.

Transfer learning does not require a lot of data because the algorithm is not trained from scratch to build the model. Rather, a pre-training model that pre-trains these parameters is used. A small amount of data is needed to adapt the trained model to the problem at hand.

2. No need to label the data to grow the dataset

If the dataset used to train and test machine learning algorithms is not large enough to ensure that the model reaches a solid learning state, some machine learning engineers do so in different ways. Tends to grow large datasets. The most preferred method is to collect more realistic samples and label them for use in training the algorithm. Manual labeling is not easy and automatic labeling may not be accurate enough.

For some types of problems, labeling may not be a problem. Instance labeling takes place after the instance itself becomes available. For some problems, the number of instances is limited and it is not easy to create more instances. For medical images, patient permission is required to use the data in the experiment, and not all patients agree. If there is no way to create more instances, techniques such as image data augmentation may help, but they haven't served much yet. This is because it's just transforming (rotating, etc.) the same instance to produce more images.

Transfer learning addresses this issue because it does not require a lot of data because it does not need to build a model from scratch. You only need a small amount of data to fine-tune the pre-trained model. It's preferable to use more data to fine-tune the model, but if you can't, that's okay.

3. Unbalanced distribution of data

In the previous point, we mentioned the problem that the dataset is balanced, but the sample size of all classes is small. Balanced means that all classes of the data in the entire dataset are approximately equal, and each class has a significantly larger sample size than the others.

For other problems, one class may have more samples than another. In this case, the dataset is said to have an imbalance in the class distribution. As a result, machine learning models are strongly biased towards this class and are more important than other classes. The probability of classifying input samples according to this class label is higher than in other classes. Classes with a high percentage of samples are called majority classes, and other classes are called minority classes. Technicians have to deal with this problem in a variety of ways.

If the number of samples in the minority class is small enough to build a machine learning model, instead of using all the samples in the majority class, select an even proportion of the samples. You can get a well-balanced dataset.

If the minority class has a small sample size and such a number is not enough to build a machine learning model, then some new samples of the minority class must be added. As mentioned earlier, the most preferred method is to collect more realistic data for the minority class. If not applicable, some syntactic techniques are available to create some new samples. One of these techniques is called SMOTE (Synthetic Minority Over-sampling Technique). The problem is that the sample generated in this way is not realistic. The more unrealistic samples are used, the worse the learning process.

Transfer learning overcomes this problem for the reasons mentioned above. You can fine-tune your model with a few realistic samples. Of course, more samples are preferred to do this task, but if you don't have many samples, transfer learning is a better option than learning from scratch.

4. Training a DL model from scratch requires high processing power and takes a lot of time

The points so far have denied the idea of building a model from scratch and using transfer learning just because the amount of data is insufficient. Assuming too much data, does that mean building a model from scratch and not using transfer learning? That's definitely NO. Transfer learning is not only selected when the amount of data is small, but building a model from scratch requires a powerful machine and a large amount of RAM. Not all machines with such specifications can be used. Even if cloud computing is available, it can be costly for some people. Therefore, transfer learning may be used even with a sufficient amount of data. With enough data, you may be able to fine-tune your model to adapt to the problem you want to solve.

The model is assumed to be generic and the engineer adapts it to the problem it is trying to solve. This is a move from the general case to a more specific case that suits your purpose. Of course, tweaking may not require the large amount of data used when building a model from scratch, but it still helps to adapt the model to the problem.

5. New test samples are not included in the training data

Most of the problems solved using machine learning have similar training and test data and come from the same distribution. Models generated based on this data do not seem difficult to test with samples similar to the trained data. The problem is that some new samples are not similar to the training data and may follow a slightly different distribution.

Machine learning engineers address this issue by building new models that work with these new samples. As it is not counted, it is not possible to change the behavior of the pre-trained model for each group of samples that have different characteristics than those used in the training. Pre-trained models may be used in production and cannot be modified as some new samples become available.

With transfer learning, the pretrained model has already seen thousands to millions of samples covering many cases that may be present in the test data. You are less likely to see unfamiliar samples in the future.

6. To build a model from scratch, you need to investigate the problem and have a deep understanding of how things work.

When a researcher builds a deep convolutional neural network (DCNN), the first step is to have a solid understanding of how an artificial neural network (ANN) works. As an extension of ANN, researchers must have a good understanding of how CNN works and its different types of layers. Researchers also have to create a CNN architecture for the problem they are trying to solve by stacking different layers. This is a daunting task and requires a lot of time and effort to derive the best CNN architecture.

With transfer learning, researchers don't need to know everything. All you have to do is worry about the number of parameters you want to tweak.

Use cases where transfer learning is beneficial

Suppose you have two classes of datasets, cat and dog, and you want to create a CNN for that classification task. It may take time and effort to create an architecture to achieve high classification accuracy. If you have the task of classifying two classes of horse and donkey datasets, it can be tiring to repeat the same task for cat and dog classification. In such cases, transfer learning is effective. What you learned from the CNNs you learned on the cat and dog dataset can be transferred to another task, horse and donkey classification. This can save you a lot of time starting over.

Terms of use for transfer learning

If you use transfer learning correctly, you can achieve great results. However, there are also misuses in transfer learning. Therefore, it is important to emphasize the main conditions for deciding whether to use transfer learning for pre-trained DL models. The conditions being discussed are as follows.

1, data type consistency 2, problem domain similarity Consider these two conditions.

1. Data type integrity

Before moving learning from one problem to another, the two problems must match with respect to the data used. Data types mean images, sounds, text, and so on.

If images are used to build DL models, they must also be used to move the learning of such models to new problems. It is not correct to move what you have learned with images to new tasks with audio data. The features learned from images are different from what should be learned from audio signals, and vice versa.

2. Problem domain similarity

Data integrity is a very important factor that must be in effect prior to transfer learning. Other factors contribute to maximizing the benefits of using transfer learning. It is preferable that there are similarities between the two problem areas. We will talk not only about the type of data, but also how similar the two problems are. One issue may be related to the classification of cats and dogs. The learning gained from this problem seems to apply to another problem that classifies two other animals, such as horses and donkeys.

Although it is possible to transfer this learning to the problem of classifying two types of tumors in different areas of the problem, it has limited capabilities. Even if image data is used in the problem of classifying two types of tumors, metastasis learning is applicable but limited in its ability due to the different areas of the problem. At CNN, several layers are learning generic features that can be applied to any type of problem. By digging deeper into the CNN, the layer will focus on the tasks that will be solved. Since the cat and dog datasets are similar in area to the horse and monkey datasets, many of the features learned in either model can be applied to other problems. In other words, similarity extends to deeper levels within the CNN, but when such learning is transferred to problems in another area, it reaches a shallower level, whereas in CNN it extends to a deeper level.

Conclusion

This tutorial first described the traditional machine learning pipeline and emphasized that manual feature extraction is difficult, especially for large and complex datasets. With deep learning, feature extraction is automated. However, building a deep learning model from scratch requires a large dataset. Transfer learning is an option for using deep learning for automatic feature extraction from small datasets.

In our next tutorial, Part 2, we'll show you the practical part of downloading, preparing, and analyzing the contents of the Fruits360 dataset. By the end of Part 2, in addition to the class labels, a NumPy array will be created to hold the image data for all datasets. Such data is later supplied to MobileNet to extract features after mobile learning.

Alibaba Cloud is the No. 1 (2019 Gartner) cloud infrastructure operator in the Asia Pacific region with two data centers in Japan and more than 60 availability zones in the world. Click here for more information on Alibaba Cloud. Alibaba Cloud Japan Official Page *

[PYTHON] ML Pipeline: Highlights the Challenge of Manual Feature Extraction