[PYTHON] How to collect machine learning data

Recently, more and more articles are being written on the web regarding the problem of collecting data. I want to search, research, and learn.

This is a collection of links to articles that may be helpful.

-How to create a face image data set used in machine learning (1: Acquire candidate images using WebAPI service)

-Publish the know-how of creating a similar image search service for AV actresses by deep learning by chainer

-How to increase the number of machine learning dataset images

-Deep learning to determine if you have big breasts from your face photo (it works or not)

-Learning data set 2 that can be used for extracting feature points of face images

-Learning data set that can be used for extracting feature points of face images (updated from time to time) --This kind of manual feature point annotation is important for the first learning of facial feature point extraction. ――However, once a method has been constructed that gives the accuracy of deriving the feature points of the facial image, it is necessary to utilize those methods. --Currently, the performance of the feature point (kandmark) derivation library such as dlib is high, so in most cases, that library can replace the manual work.

-Nekoto Image Processing part 1-Material Collection

-How to put OpenCV in Raspberry Pi and easily collect images of face detection results with Python

--Ryan Mitchell, Translated by Toshiaki Kurokawa, Technical Supervision by Takeshi Shimada "Web Scraping with Python"

--Toby Segaran, Translated by Hitoshi Toyama and Masao Kamozawa "Collective Intelligence Programming"

--Interface July 2016 issue ["From how to make the most difficult learning database to Raspberry Pi 1, 2, 3 recognition test Learning and recognition of the target fish "Nabeka"]](http://www.kumikomi.net/interface/contents/201607.php)

** Manual input **

According to reports, data for machine learning is often created manually. Also, if the purpose is clear and you expect to recover the results of your investment in the work, hire a large number of people. I hear that you are constantly adding input data manually and making improvements.

** Importance of negative samples **

In the field of pedestrian detection, images of roads and streets that do not contain pedestrians are very important. In the case of an in-vehicle camera, it is important that the data has an angle of view that can be seen from the vehicle. If you want to learn pedestrian detection with Boosting, you need a large number of images that do not include people. With Cascade-type classifiers, the later the stage, the higher the proportion of confusing images. In such a case, if you find a human image and use it as a negative, the performance of the detector will be significantly reduced. With Cascade-type classifiers, the later the stage, the more the trained results tend to be more dependent on the trained data set (both positive and negative images). (Addition: Are few people using Boosting these days? The importance of negative samples remains the same.)

For example, when trying to make a dog face detector, it is not certain that collecting as many dog faces as can be detected by existing detectors will be useful for the performance of the detector. Shiba Inu and Bulldog have too different facial shapes. I think it is doubtful that the bulldog's face can be detected by collecting only the shiba inu's face. Just because it can be detected in one face, it does not mean that it can be detected in another. Therefore, it is dangerous to try to improve the performance of the detector by using the image that can be detected by the existing detector. You should make it possible to use images that cannot be detected by existing detectors, such as by using the tracking results of the next time in the scene where the dog's face can be detected. (I would like to know what this situation is like in deep learning.) It is claimed that deep learning can authenticate a person based on a profile and compared with a database of front faces.

High matching rate even with a face facing sideways, sunglasses and masks Panasonic achieves world-class face matching with deep learning

You can also use YOLO to detect many types of objects in your videos. Even if there is a false detection, it is convenient to have a high detection speed if it is assumed that the selection will be done manually.

** Detector that can learn with few images **

The HOG + SVM detector in dlib can be an object detector with very little positive data in the image. It's surprising that it's very different from the Haar Casecade detector.

Machine learning with dlib to detect objects

** Use of existing detectors that can be used **

When collecting training data for hardware development, it is also possible to collect the data using a software version of the detector.

reference: Importance of machine learning datasets

CIFAR-10 and CIFAR-100 are a dataset of 80 million labeled color images with a size of 32x32. [Python] How to read CIFAR-10, CIFAR-100 data

Postscript:

There are various trained models in Model Zoo. If you make a detector using it, you can sample the image and automatically generate annotations within the range of the learned characteristics.

--Allows the automatic generation of annotation files for images using the existing detector in model zoo. --Set the threshold value of the detector loosely, and collect images that are slightly outside the original learning range. ――From the collected data, pick up what should be detected by the detector you want to make, and annotate it with your own rules. --Run the learning program using both the automatically generated annotation data and the data of the learning range extended by your own rules. --Repeat image sampling and automatic annotation using the learned results. --Learn again. By collecting the data in this way, you can collect the learning data of the template for making the detector you want to make.

** People (pedestrian) relationships **

It is a database of pedestrians with segmentation. Only available for non-commercial purposes. It is useful for learning and evaluation of pedestrians.

** Face relationship **

Face Database

FDDB: Face Detection Data Set and Benchmark

https://github.com/StephenMilborrow/muct#the-muct-face-database

As a negative dataset http://cocodataset.org/#home

link collection Computer Vision Datasets

CVonline: Image Databases

Yet Another Computer Vision Index To Datasets (YACVID)

60 Facial Recognition Databases

Let's examine the learning data used in the paper.

In most of the papers, the origin of the data learned in the implementation is written. So, as you read through those things, you'll reach the data.


Postscript:

In the field of face detection and human detection, there are some open source implementations with reasonable accuracy. So, there is no way not to use it to create a learning dataset, a detector for your own purposes. If you expand the learning data to a ratio of data that is close to your purpose, there is a high possibility that you will get closer to a detector that covers your purpose.

SlideShare SSII2018TS: Large-scale Deep Learning

Concept of each stage of collecting data for machine learning It is not a good idea to use the ratio of training data as it appears. How machine learning datasets are lost How a sloppy person manages experimental data

Recommended Posts

How to collect machine learning data
scikit-learn How to use summary (machine learning)
How to collect Twitter data without programming
How to enjoy Coursera / Machine Learning (Week 10)
Introduction to machine learning
An introduction to machine learning
Super introduction to machine learning
How to handle data frames
[Python] How to FFT mp3 data
Introduction to machine learning Note writing
How to adapt multiple machine learning libraries in one shot
How to read e-Stat subregion data
Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
How to collect images in Python
How to split machine learning training data into objective variables and others in Pandas
How to deal with imbalanced data
How to deal with imbalanced data
Made icrawler easier to use for machine learning data collection
Machine learning
How to use machine learning for work? 03_Python coding procedure
[Python] Save PDF from Google Colaboratory to Google Drive! -Let's collect data for machine learning-
How to Data Augmentation with PyTorch
Machine learning in Delemas (data acquisition)
Introduction to Machine Learning Library SHOGUN
Preprocessing in machine learning 2 Data acquisition
Preprocessing in machine learning 4 Data conversion
Basic machine learning procedure: ② Prepare data
How to use machine learning for work? 01_ Understand the purpose of machine learning
People memorize learned knowledge in the brain, how to memorize learned knowledge in machine learning
How to create a serverless machine learning API with AWS Lambda
Record the steps to understand machine learning
Machine learning imbalanced data sklearn with k-NN
I installed Python 3.5.1 to study machine learning
An introduction to OpenCV for machine learning
How to study deep learning G test
Python: Preprocessing in machine learning: Data acquisition
Introduction to ClearML-Easy to manage machine learning experiments-
[Python] First data analysis / machine learning (Kaggle)
How to collect face images relatively easily
How to use "deque" for Python data
An introduction to Python for machine learning
Python: Preprocessing in machine learning: Data conversion
How to handle time series data (implementation)
Preprocessing in machine learning 1 Data analysis process
How to read problem data with paiza
How to use machine learning for work? 02_Overview of AI development project
Specific implementation method to add horse past performance data to machine learning features
[Memo] Machine learning
I started machine learning with Python (I also started posting to Qiita) Data preparation
Machine Learning sample
How to make a face image data set used in machine learning (3: Face image generation from candidate images Part 1)
[Python] Easy introduction to machine learning with python (SVM)
[Super Introduction to Machine Learning] Learn Pytorch tutorials
An introduction to machine learning for bot developers
Data supply tricks using deques in machine learning
[Django] How to get data by specifying SQL.
[Python] How to read data from CIFAR-10 and CIFAR-100
Notes on machine learning (updated from time to time)
Machine learning algorithms (from two-class classification to multi-class classification)
How to scrape horse racing data with BeautifulSoup
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-