[PYTHON] Data set IMDB-WIKI for estimating age and gender from facial images

Introduction

Data is the life of machine learning, especially deep learning. Here, we use the IMDB-WIKI dataset that can be used to learn the task of estimating age and gender from facial images. introduce. In this article, even the shaping of data for learning. Next time, I would like to study age / gender estimation CNN using CNN.

The code is below. https://github.com/yu4u/age-gender-estimation

IMDB-WIKI dataset

This dataset is a database created by crawling Internet Movie Database (IMDb; online database of actors in movies and TV shows) and Wikipedia, with profile images, images of facial areas extracted from profile images, and metadata about people. Consists of. IMDb contains 460,723 facial images and Wikipedia contains 62,328 facial images. This dataset was used as a pre-learning dataset by the winning team in the competition to estimate age from the image ChaLearn apparent age estimation challenge. Fine-tuning with competition training data)

Get dataset

The original image is very large, so download an archive of images and metadata with extracted facial areas.

#IMDb data
wget https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/static/imdb_crop.tar
#Wikipedia data
wget https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/static/wiki_crop.tar

All the metadata is stored in the .mat file (wiki.mat for Wikipedia) in the above archive. Although it is Matlab data, it can be read as follows with scipy.

meta = scipy.io.loadmat("wiki.mat")

The data included is as follows.

--dob: Date of birth --photo_taken: The year the photo was taken --full_path: Image path --gender: Gender --name: Name --face_location: Rectangle information of face area --face_score: Face detection score --second_face_score: Second largest face detection score

These can be obtained as follows.

full_path = meta[db][0, 0]["full_path"][0]
dob = meta[db][0, 0]["dob"][0]

Age calculation

The age of the person in the image is not directly included in the metadata and needs to be calculated. As a policy, the date of birth (dob) should be subtracted from the shooting date and time (photo_taken), but the following detailed measures are required. First, dob is a value formatted to Matlab serial date number and cannot be used unless it is converted to time information. Python has the ability to handle similar formats and can be converted, but Matlab is defined as the number of days since January 1, ** 0 ** AD, whereas Python ** 1 ** AD. [There is a trap] that is defined as the number of days from January 1, 2011 (http://sociograph.blogspot.jp/2011/04/how-to-avoid-gotcha-when-converting.html) (What That's right). Therefore, it is necessary to convert as follows (Year 0 is a leap year!)

python_datetime = datetime.fromordinal(matlab_serial_date_numer - 366)

Also, since photo_taken has only information on the year, if the date and time in the middle of the year is used as the expected value, the age is calculated as shown below.

def calc_age(taken, dob):
    birth = datetime.fromordinal(max(int(dob) - 366, 1))

    # assume the photo was taken in the middle (Jul. 1) of the year
    if birth.month < 7:
        return taken - birth.year
    else:
        return taken - birth.year - 1

cleaning

Since the data provided is basically crawling data, it contains a lot of noise data and needs to be removed. For example, in the case of Wikipedia database, the following data is included.

--There are 18016 cases where face_score is -inf. --There are 4096 cases where second_face_score is not nan. If there are more than one face, the metadata may be for second_face --The calculated age is negative or greater than 100

Therefore, it is necessary to extract only those whose face_score is above a certain level, whose second_face is nan, and whose age is in the range of 0 to 100.

Data set details

Finally, let's take a look at the contents of the Wikipedia dataset. Details are below. https://github.com/yu4u/age-gender-estimation/blob/master/check_dataset.ipynb

Distribution of face_score. Excluding -inf. face_score.png

Distribution of second_face_score. Excluding nan. second_face_score.png

An example of an image with face_score greater than 5 high_score.png

An example of an image with face_score from 0 to 1 low_score.png

An example of an image where face_score is -inf zero_score.png

Examples of images with calculated age greater than 100 age_100.png

An example of an image with a negative calculated age age_0.png

Recommended Posts

Data set IMDB-WIKI for estimating age and gender from facial images
Caffe Model Zoo for beginners [Age and gender classification]
Data set for machine learning