[PYTHON] Importance of machine learning datasets

Here's my personal opinion about the importance of machine learning datasets.

("Then, how does it compare to other methods?" Add to the content written as an appendix to an independent article It was rewritten.)

Finding and running various machine learning examples will use a well-known dataset on the web. The creators of the data published by domestic and foreign research institutes, universities and some companies are experts in machine learning in the field, and the dataset is devoted to learning and evaluation. If you use that data and run a good sample program written on the web as it is, you can get reasonable results very easily, so you may mistakenly think that machine learning is very easy. maybe. However, it is worth knowing that collecting your own learning data is useful for creating practical machine learning results. There is a difference between face detection when matching a face with a passport photo at the airport and face detection with a digital camera or smartphone. There is a difference between human detection in the in-vehicle field and human detection in surveillance cameras. The difference in data features is greater than you can imagine. Even with handwritten characters, the behavior of pencils and ballpoint pens differs, such as faint characters. In this way, it is not uncommon for the behavior to differ from existing data when you know the habits of the target data that you should realize. So you need to know that it's not trivial for you to collect datasets for your own problems, but it's worth the effort.

** Differences in data characteristics **

Differences between frontal and profile learning

It is also written in "Detailed OpenCV" that there is a difference in the learning of the detector between the front face and the profile. The difference is that the front face has almost no part other than the face in the learning range, while the side face always has something other than the face in the background. It is stated that such differences in the characteristics of data may result in inadequate learning depending on how the training data is given. Although learning methods such as deep learning are evolving, I think that countermeasures for such problems can only be dealt with by carefully handling the data by the learner.

An example of the difference between the voice of a voice actor and the voice of an anime

In the next article, I will explain how the results differ depending on the difference between the training data and the test data. Says. We are learning and testing to identify voice actors by voice. (1): From the voice actor's original voice and voice actor data Predict the voice actor based on the voice of the anime. (2): Predict the voice actor based on the original voice from the voice of the voice actor's animation and the data of the voice actor. We are implementing the two ways. All of them state that the recognition rate is insufficient. I tried to classify the voices of voice actors

** Must have a distribution of learning data suitable for the application **

The training data must have a variance suitable for the application. The training data must be collected to have that variance. It is to confirm whether the distribution between classes and the distribution within classes are suitable for the purpose. It is important for machine learning to design and realize the collection of such learning data. It can even be said to be measurement rather than programming or devising algorithms. Let's display the unique image eigen image obtained by principal component analysis as an image using matplotlib. Then you can see what kind of variance you have. They help us understand what the data is and what the result of preprocessing the data looks like.

** Must have an average data distribution suitable for the application (additional note) **

In addition, the training data should have an average of the training data suitable for the purpose. The average of the training data affects various parts of the algorithm. Even when face matching is performed based on Eigenfaces, the eigenface appears as an eigenvalue after subtracting the average. Therefore, if the race is biased or the age is extremely different, it is possible that the face matching system built up to that point is not performing well. The age estimation system created by the database biased toward Europeans is caused by the fact that it does not perform well for Asians. It seems that this happens not only in machine learning, but also in people who find it difficult to distinguish people who are not usually seen and who are difficult to estimate their age.

** Do not make training data too difficult **

I think it is necessary for people to adjust how difficult data can be included in the training data. In the case of face detection, it is empirically known that adding an image with too much hiding of the face part deteriorates the performance of face detection. Depending on the training algorithm, the distribution of the training data may cause significant deterioration of characteristics. What to include in the training data and what to exclude is an essential task for obtaining practical learning results. (However, it is too empirical work, so it is not suitable for those who write a dissertation with the learning algorithm itself.)

** Conditional data acquisition and random data acquisition **

Data acquisition and evaluation requires the acquisition of well-conditioned data and random data. Either one will not work. The skeleton of the algorithm cannot be created without using well-conditioned data.

Let's take face detection and face matching as an example. You need a database that includes the orientation of your face and lighting conditions. Since the database has all the lighting conditions, it is possible to evaluate how the influence of lighting affects face matching.

A database with these conditions may be well needed for machine learning for other purposes.

Well-conditioned data is very different from general data in the world. The face image of the ID photo is a photo with a flat background but no image (http://www.keishicho.metro.tokyo.jp/menkyo/koshin/koshin/koshin02_2.html).

Many such data have been created since the early stages of research and development of face detection and face matching.

Next, we need to get random data.

When trying to detect a face for a certain purpose, it is necessary to acquire data with reduced manpower when knowing how well the system under development is performing. Recently, this type of data set has been increasing. The face matching database Labeled Faces in the Wild Home is such a database.

Over-conditioning data carries the risk of many omissions.

** Dataset needs to be revised repeatedly **

When "then, how does it compare to other methods?" Is appropriate, when choosing a method to make a first hit, "easy to start, a method suitable for this problem, literature-based knowledge and I think that "the stage of choosing based on the experience of myself and others" and "the stage where the data are available and it makes sense to compare the methods". No matter how good the method is, it cannot be evaluated properly if the method is not understood by the person who uses it. Easy to get started, easy to understand for beginners, and easy to handle are also necessary for the first method. Please note the following points in order to reach the stage where the data is available and it makes sense to compare the methods.

--When the low recall rate is noticeable

A low recall rate for the data in that category indicates that there is not enough training data for that category. The number of training data in that category itself may be small, or the distribution of training data may be small. There is a high possibility that the recall rate will improve when they are supplemented with actual data or processed data.

--When the low accuracy is noticeable

This means that there is a lot of data in another category that falls into that category. Look at the confusion matrix to determine which categories are influencing. Trying to increase the training data for that category will also improve the category with low accuracy.

--Algorithms where the boundary between positive and negative is important, and algorithms where the distribution itself is important

There are two types of algorithms, one in which the boundary between positive and negative is important, and the other in which the distribution itself is important. This will change the way learning data is collected. If you add only data that is easy to distinguish, there is an algorithm that the recognition performance deteriorates compared to before the addition. AdaBoost is one such algorithm. On the other hand, in SVM, only the data that gives the boundary (support vector) is important. On the other hand, adding data that is too difficult can also reduce the recognition rate.

** When there is little data, use the algorithm that peels when there is little data. ** **

Boosting is an algorithm that is suitable when there is a lot of data. If you want to start by getting slightly better results than bullshit with a small amount of data, Naive Bayesian Example [Probability calibration of classifiers] You could consider using an algorithm such as (http://scikit-learn.org/stable/auto_examples/calibration/plot_calibration.html). If the K-nearest neighbor method is used for less training data than random, it seems that a slightly better result can be obtained.

** When the number of data is small, the evaluation between algorithms is constrained **

Depending on the combination of the degree of freedom of the model and the number of training data, it is easy to fall into a situation where the value cannot be determined appropriately for the degree of freedom of the model. This is Curse of dimensionality Is known.

** Don't spare the trouble of collecting data **

Once you've got people at the top of your organization to understand the importance of machine learning datasets, don't hesitate to collect the data. Gathering the data and getting the first version of machine learning up and running reveals the problem. After collecting the data and starting to run the first edition, we finally see what we will need in the future. Plans made without looking at the actual data can be a big omission. In the case of a company that focuses on face recognition, the research department may have equipment that captures faces from multiple directions at once. Change the light irradiation conditions and shoot at the same time. In order to obtain a data set that includes the difference in the light irradiation conditions of the face image in the dispersion, it is necessary to shoot the same person with the same facial expression and with various light irradiations. By performing principal component analysis from such data, it is possible to extract the components of the eigenface due to the difference in light irradiation conditions. The eigenface component obtained in this way due to the difference in light irradiation conditions is a component that should not be useful for personal authentication. Therefore, it is necessary that the experimental data for determining the eigenface component due to the difference in the light irradiation conditions has a range of means for the purpose of use. The face of an elderly person close to 100 years old is clearly different from the face of a 70-year-old man. It's worth verifying that the rate of facial recognition for people close to 100 years old can be achieved with products that are usually on the market.

An example of an eigenface is "Practical Computer Vision", Programming Computer Vision with Python It is written in / 0636920022923.do). Among them, the case of the front face before the position of the face is sufficiently normalized and the case of the eigen face of the same person after the positions of both eyes are normalized and aligned are described. I have. Looking at these differences, it is easy to understand that unless you shoot the same person with the same facial expression and with various light irradiations, you will not be able to obtain an eigenface that strongly reflects the difference in light irradiation. I think you can. On top of this kind of data collection, it becomes possible to extract elements that are not easily affected by the effects of light irradiation. (That is, if you want to properly include the light irradiation component in the principal component analysis, you need to normalize the position of the face. Also, properly include the light irradiation component. If you want to, it can be said that the distribution of the shape of the human face also needs to be widened to reflect the diversity of the actual face.)

** Lower the cost of data collection **

It may sound inconsistent with what I said above, but it's about thinking about ways to reduce the cost of collecting data at the same time. Without collecting data, it is not even possible to see if existing classifiers are learning well.

--Automatic cutting from video --Detection from the web --Use of a camera that can obtain depth information such as Kinect

Using Kinect and other OpenNI compatible depth sensors

** Exclusion of incorrectly labeled data **

When building a dataset, it can be mislabeled. If such data is in the training data, it will lead to deterioration of the training result. So make sure you don't have any data that has been incorrectly labeled. This is not a trivial matter, depending on the training dataset. A normal person cannot tell whether an ultrasound image is a positive image with a medical condition or a negative image without a medical condition. Unless properly labeled, machine learning will not produce good results. Based on the learning results, if there is learning data with a low recognition rate, it may be worth checking whether the label of the data is really correct.

** Discrimination performance may change depending on the distribution of training data. ** **

Depending on the learning algorithm, the discrimination performance is determined by the data boundary, but in many cases it is determined by the data distribution rather than the data boundary. Therefore, what is included in the learning data and what is not included is important learning know-how. Let's graph the distribution of the scores of the training data based on the trained results. Let's see what kind of data is abundant in the data with a low score. If the learning is for image data, displaying those images in an html file makes it easier to grasp the features. What kind of data is added to the training data will lead to the success or failure of the learning.

** Unintended bias of training data **

If there is an unintended bias in the training data, the resulting training result may be incorrect. For example, suppose you want to detect the head of a person above your shoulders. At that time, suppose that a person stands in front of a wall and is trained by a large number of people. If you build a detector using that data, it can happen that it is a wall detector in the surrounding area. Try to test thoroughly to make sure that there is no unintended bias in the learning results. By doing so, we should be able to discover the possibility of such bias in learning data and plan and promote experiments with countermeasures.

** Unintended bias of learning data (additional note) **

For example, when making a head detector by learning, the performance of head detection may differ depending on the color of the hair. In the case of shade-based features, it seems that a lot of learning data that assumes that black hair is blacker than the surroundings is often used. Then, for skinheads and gray hair, the detection rate may not be as high as for black hair. Sometimes it's necessary to wonder if that's happening in the machine learning you're working on.

** Use other techniques to make data collection easier. ** **

For example, suppose you want to collect training data for dog detection. In that case, an object moving with a fixed camera can be extracted by a method such as background subtraction. The size of the object is estimated based on information such as the camera installation angle. Based on its size, you can collect images that are likely to be dogs. By selecting an image of a dog from that image, you can make it easier than doing nothing. In addition, since a distance sensor such as kinect can be used, it may be possible to select data based on distance information. Even if the technology cannot be used as the final product, let's actively use it at the technology development stage.

** Use of CG data **

In the case of image-based machine learning, labeling correct answer data requires a great deal of power. At this time, depending on the type of question, CG data may be used to create the correct answer data that has been labeled [1]. SYNTHIA Dataset is one such thing.

-4. Generation of learning samples by CG and efficiency of learning by MILBoost

-Higher accuracy of human detection by generating additional learning of human body silhouette

[Survey paper] Human detection by statistical learning method 5.2 Collection of learning samples

SYNTHIA Dataset

The SYNTHetic collection of Imagery and Annotations, is a dataset that has been generated with the purpose of aiding semantic segmentation and related scene understanding problems in the context of driving scenarios. SYNTHIA consists of a collection of photo-realistic frames rendered from a virtual city and comes with precise pixel-level semantic annotations for 13 classes: misc, sky, building, road, sidewalk, fence, vegetation, pole, car, sign, pedestrian, cyclist, lanemarking.

** Reflect the knowledge of experienced machine learning people in the construction of datasets **

There is experience in the field of machine learning as to what kind of data should be collected. It is worth experiencing what kind of data would lead to wrong learning. I intended to learn human detection, but it may be that I learned the background of the shooting environment where the learning data was collected. You may want to make a lean data acquisition plan, but you need to collect data, learn, evaluate results, and think about how to improve. It is to improve the part that collects data based on the knowledge obtained in that process.

** Importance of negative samples **

How to collect negative samples and how to ensure the quality of negative samples are also important issues. One example is pedestrian detection. It is necessary to train the roadside trees as non-pedestrians in order to determine that non-pedestrians are not pedestrians in the images that the car may see while driving. Mailboxes are also not pedestrians. The signboard is not a pedestrian either. The edge of the guardrail is also not a pedestrian. Neither the pedestrian crossing nor the markings on the road are pedestrians. The walls of the building are also not pedestrians. Learning is important so that all non-pedestrians are not falsely detected. It is necessary to collect enough such negative samples and to learn by including them in order to realize a product-level detector. Mobileye is Through such learning, we provide monocular pedestrian detection to various vehicles.

Negative samples are very important.

** Notes for cascade classifiers **

In the case of the Haar like cascade discriminator, the generalization of the learning result tends to be doubtful as the later strong discriminator.

The positive sample mixed in the negative sample can cause the learning result to be strange. As a result, it is easy to get into a situation where something that should have been sufficiently detected cannot be detected. Therefore, the positive sample used for the strong classifier in the latter stage is worth checking visually. By doing so, you can prevent the learning result from being strange due to the positive sample mixed in the negative sample.

** Is the test data really sufficient as test data? ** **

-It is essential that the machine learning test data does not include the data used for learning. -If the data is increased by processing, do not divide the processed data for training and testing. Those using the same original image will be included for learning and testing, so it will give better test results than it actually is. -If you use a dataset that contains facial images of celebrities, there is a risk that different frontal facial images of the president will be included for learning and testing.

** Let's understand the purpose of use **

You need to start by understanding how the features you implement are used. It's about understanding what the impact of that classifier's failure is and how it should be implemented. Think about whether other auxiliary methods can improve it. It's a universal theory of machine learning.

SlideShare Code design to avoid crying with machine learning

If you have enough data and want to provide the learning data to the outside to develop an algorithm, you can visit the site Kaggle. There is a way to use it. Some have set prizes, while others have recruiting competitions.

Why data scientists from around the world gather at Kaggle, from bounty hunters to job hunting

I tried Kaggle using English I can't do

The next article points out a problem in the learning of binary classification when an image that is neither of them comes in (it's like tomato juice coming in to input blood type judgment). A system that includes machine learning should be built, including what kind of system should be built according to the purpose of use, and what to do with the preprocessing immediately before that.

Deep Learning recognizes your boss and hides the screen

** Note: Is it possible for people to learn? ** **

In the case of machine learning of images, as an intuition as to whether or not learning is possible, the viewpoint of whether or not the content can be learned by a person is used. As an example, consider age estimation and face recognition. I find it difficult to estimate the age of a person or determine whether they are the same person if they are of a different ethnicity or race. I think this is due to the fact that the average face, facial fluctuation factors, and principal component analysis components differ depending on the population. Even in the area of the face, where research and development efforts are not trivial. It should be remembered that machine learning in other fields tends to run out of data, and "if garbage enters, garbage will come out."

[Note 1]: Even in the field of stereo measurement, CG is becoming a situation that cannot be ignored. It is not easy to prepare a stereo image with actual measurement data that has the correct answer in the field of stereo measurement. Therefore, it seems that in a situation where there is a correct answer with 3D CG, a stereo image is generated and the algorithm is evaluated based on it.

Let's take a closer look at the data

Looking closely at the data and observing it leads to knowing that you will not notice it if you look only at the results. The following article is such an example.

Deep Learning! The story of the data itself that is read when it does not follow after handwritten number recognition

(17) Automatic recognition of autographed number images with cuda-convnet

Shinsai FaxOCR Handwriting recognition dataset Test data (MNIST IDX format)


Postscript I feel that dlib's detection of facial organ points is excellent. If you use the result, it will be easier to normalize the face image. Unless you're trying to build a new one with more precision than is possible with dlib I feel that the use of such tools is effective. Until now, due to the difficulty of having to create a large number of correct eye position inputs. The situation was that the development of facial algorithms could only be done in a limited number of places.


Let's read the dissertation.

In the paper, in order to assert the goodness of the image recognition technology in comparison with others, it is always compared in the database. When learning, it should describe what kind of database you are learning.

Most of the time you are using a public database. In rare cases, we use the data we have acquired for learning and evaluation. Even in such a case, there are many cases where the identity of the data used is stated or the data is disclosed.

With the latest papers, for that purpose, you can find out what kind of database is used for recent comparisons.

You can also read papers for that purpose.


Postscript:

What you should not do when creating and using machine learning datasets

--There is no distinction between evaluation data and learning images.

If the data is not managed properly, there is a possibility that the image of the evaluation image is used as the learning image. After copying an image for learning from somewhere and making it difficult to understand the identity, it becomes difficult to trace the source. Once you have the data, manage it separately for evaluation and learning at an early stage. You have to declare it separately for evaluation and learning at an early stage before you let other members of the team use it.

--The evaluation data and the learning data are made too similar images.

Cut out a still image from the video frame and shuffle it to separate it for evaluation and learning. However, since the shooting environment is the same, if the time interval for cutting out still images is too short, the images will be too similar. In such a situation, the difference in characteristics between the evaluation image and the learning image disappears. Then, the evaluation will be overly good.

However, as soon as the actual environment is reached, the performance will not be achieved due to fluctuations.

The problem that the evaluation image and the learning image are too similar can come in unintentionally. In the face matching system, when creating the trained data for constructing the face matching system, there is a concern that an image very similar to that included in the face matching for evaluation is used unknowingly. For example, the face image of Barack Obama (former president) is easy to be included in the database of face images, so even if different datasets are used for learning and evaluation, Barack Obama is included in both. It's easy to get rid of it.


Postscript People are often detected in soccer fields with a green background, so if you study specifically for that, the detection rate may drop if the background is not green.

When using your own data, it is essential to check the validity of the data. If the quality of annotation is not improved, the accuracy of learning may not be improved.

One of the reasons why the accuracy of face matching technology has improved dramatically in recent years is I've heard that a large-scale face matching database has been built, and that data has come to ensure the quality of face-name mapping.

A Dataset With Over 100,000 Face Images of 530 People

A good dataset will live longer than an individual algorithm implementation.

Of the face detection / person detection algorithms and the database for face detection / person detection, the database lasts longer.

Viola Jones' face detection algorithm was published in 2001. And many cascade detectors were implemented that were inspired by it. Since then, with the progress of deep learning, the algorithm for face detection has changed significantly. Not only has the software algorithm changed, but the face detection hardware mechanism has also changed significantly. Recently, various hardware implementations that accelerate the framework of deep learning have appeared, so the hardware for face detection and person detection has also changed significantly.

Nevertheless, some face and human detection datasets are older than the Viola Jones paper.

Well-crafted datasets continue to be useful with longer lifetimes than algorithms. So it's important to grow a dataset about machine learning in your field.

Keep in mind that the implementation you are currently implementing will eventually have to be replaced by another implementation. Even in the case of deep learning, there will be a need to move to different implementations of network models. It is important to carefully grow the dataset in preparation for that time.

Creating a good dataset is a good problem setting

The most important thing in solving a problem is how to set the problem. If you make a mistake in setting the problem, the problem will be troublesome. Setting a good problem is an important point in solving a problem. A good dataset is one that thinks about what the problem is and how to get it closer to a solution.

Example: LFW

The purpose is to increase the scale of the database for face matching and to create a face matching database with images with shooting conditions closer to the actual environment, including images with different shooting conditions. Example: VGG Face2 It has a variety of face orientations, including a half profile, and has more faces than LFW. The image has been manually cleaned.

If your problem settings are reasonable, prepare the appropriate dataset. To raise a good question, publish the appropriate dataset. Even corporate developers may be able to provide good datasets.

Example: Ask Kaggle.

https://www.kaggle.com/c/cvpr-2018-autonomous-driving

By looking at the database of questions, you can see what the current assignment settings are like.

Awareness of segmentation-based objects is important. Being able to process in real time.

It shows that early pedestrian detection has changed from what was based on the rectangle of the detection frame.

https://www.kaggle.com/c/mercari-price-suggestion-challenge Questions are also being asked by Japanese companies.

Example: Publish the database.

Daimler Pedestrian Benchmark Data Sets

Related article Search for annotations Annotation tool (correct answer input tool) is evolving.

How a sloppy person manages experimental data Concept of each stage of collecting data for machine learning It is not a good idea to use the ratio of training data as it appears.

Should undetected data be added or postponed?

Recommended Posts

Importance of machine learning datasets
Basics of Machine Learning (Notes)
Machine learning
Significance of machine learning and mini-batch learning
Machine learning ③ Summary of decision tree
Machine learning algorithm (generalization of linear regression)
2020 Recommended 20 selections of introductory machine learning books
Machine learning algorithm (implementation of multi-class classification)
[Memo] Machine learning
Machine learning classification
[Machine learning] List of frequently used packages
Judgment of igneous rock by machine learning ②
Machine Learning sample
Machine learning memo of a fledgling engineer Part 1
Classification of guitar images by machine learning Part 1
Beginning of machine learning (recommended teaching materials / information)
Machine learning of sports-Analysis of J-League as an example-②
Python & Machine Learning Study Memo ⑤: Classification of irises
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
Full disclosure of methods used in machine learning
List of links that machine learning beginners are learning
Overview of machine learning techniques learned from scikit-learn
About the development contents of machine learning (Example)
Summary of evaluation functions used in machine learning
Analysis of shared space usage by machine learning
Reasonable price estimation of Mercari by machine learning
Classification of guitar images by machine learning Part 2
Get a glimpse of machine learning in Python
Try using Jupyter Notebook of Azure Machine Learning
Arrangement of self-mentioned things related to machine learning
Causal reasoning using machine learning (organization of causal reasoning methods)
Machine learning tutorial summary
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression
Deep learning 1 Practice of deep learning
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors
What is machine learning?
Key points of "Machine learning with Azure ML Studio"
Image collection Python script for creating datasets for machine learning
[Recommended tagging for machine learning # 2] Extension of scraping script
[Recommended tagging for machine learning # 2.5] Modification of scraping script
About data preprocessing of systems that use machine learning
Impressions of taking the Udacity Machine Learning Engineer Nano-degree
About testing in the implementation of machine learning models
Predict the gender of Twitter users with machine learning
Summary of the basic flow of machine learning with Python
Record of the first machine learning challenge with Keras
Machine learning model considering maintainability
Machine learning learned with Pokemon
Data set for machine learning