[PYTHON] Similar face image detection using face recognition and PCA and K-means clustering

Introduction

Similar image detection is one of the most commonly used features in image recognition. Recommendation systems and search systems often use tens of thousands or hundreds of thousands of images. Depending on the size of the image and the comparison method, searching for a similar image from among thousands or tens of thousands requires a huge amount of processing time. Therefore, we will consider a method of detecting similar images by reducing the amount of data and the number of comparisons using k-means and PCA.

Face Recognition Face features use face_landmark, which is represented by a 128-dimensional vector and can be implemented in the library at the following URL. https://github.com/ageitgey/face_recognition

The number of dimensions after PCA was set to 20 while observing the contribution rate. After performing PCA and reducing the dimension, it is classified into clusters with K = 10 by k-means. The closest one is calculated from the center of gravity of each cluster, and the distance is calculated only for the images classified into the cluster with the closest center of gravity to detect similar images. It is also effective in reducing the storage capacity by saving the features of the image as data reduced by PCA.

When using 1000 images, clustering by the k-means method makes it possible to compare 100 times + 10 times (comparison with the center of gravity of each cluster) on average. In addition, the number of dimensions of each vector has been reduced from 128 dimensions to 20 dimensions by PCA, so the amount of calculation can be effectively reduced.

http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html This time, the sample source using this free face image is shown below.

program


# coding:utf-8
import dlib
from imutils import face_utils
import cv2
import glob
import face_recognition
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
import numpy as np

# --------------------------------
# 1.Preparation for face landmark detection
# --------------------------------
#Calling face detection tool
face_detector = dlib.get_frontal_face_detector()

#Calling a face landmark detection tool
predictor_path = 'shape_predictor_68_face_landmarks.dat'
face_predictor = dlib.shape_predictor(predictor_path)

images = glob.glob('./faces/*.jpg')
images = sorted(images)[:100]

face_landmarks = []
face_filepaths = []

for filepath in images:
    #Calling in the image to be detected
    img = face_recognition.load_image_file(filepath)

    face_encodings = face_recognition.face_encodings(img)
    if (len(face_encodings)>0):
        face_filepaths.append(filepath)
        face_landmarks.append(face_encodings[0])

pca = PCA(n_components=20)
pca.fit(face_landmarks)

#Convert the dataset to principal components based on the analysis results
transformed = pca.fit_transform(face_landmarks)

#Plot the principal components
# plt.subplot(1, 2, 2)
plt.scatter(transformed[:, 0], transformed[:, 1])
plt.title('principal component')
plt.xlabel('pc1')
plt.ylabel('pc2')

#Output the contribution rate for each dimension of the main component
print(pca.explained_variance_ratio_)
print(sum(pca.explained_variance_ratio_))

# print(transformed[0])
# print(len(transformed[0]))

#Start Kmeans
#Number of clusters
K = 8
cls = KMeans(n_clusters = 8)
pred = cls.fit_predict(transformed)

#Each element is colored and displayed for each label
for i in range(K):
    labels = transformed[pred == i]
    plt.scatter(labels[:, 0], labels[:, 1])

#Cluster Centroid(Center of gravity)Draw
centers = cls.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], s=100,
            facecolors='none', edgecolors='black')

#Find which center of gravity is closest to you
min_center_distance = -1
min_center_k = 0

#Find which center of gravity is farthest
max_center_distance = -1
max_center_k = 0

for center_index in range(K):
    distance = np.linalg.norm(transformed[0] - centers[center_index])
    if ( distance < min_center_distance or min_center_distance == -1):
        min_center_distance = distance
        min_center_k = center_index
    if ( distance > max_center_distance or max_center_distance == -1):
        max_center_distance = distance
        max_center_k = center_index

#Show image names of the closest and farthest clusters
print('=========== NEAREST ==============')
for i in range(len(pred)):
    if ( min_center_k == pred[i] ):
        print(face_filepaths[i])
print('=========== FARTHEST ==============')
for i in range(len(pred)):
    if ( max_center_k == pred[i] ):
        print(face_filepaths[i])
print('=========================')

#Display the graph
plt.show()


#* Below this is a snake leg
#Calculate the direct distance to each image
distance = {}
for index in range(len(transformed)):
    distance[face_filepaths[index]] = np.linalg.norm(transformed[0] - transformed[index])

#Sorted and displayed in order of distance
print(sorted(distance.items(), key=lambda x:x[1]))

Clustering result graph

The center of gravity is displayed in color-coded features of the image divided into each cluster by a hollow circle. It is a little difficult to understand because the 20-dimensional graph is plotted in two dimensions, but you can see that the main components that are close to each other are clustered together. image.png

result of analysis

Image based on analysis

1.jpg
000001.jpg

Images contained in the same cluster

10.jpg 11.jpg 19.jpg 24.jpg
000010.jpg 000011.jpg 000019.jpg 000024.jpg

Image contained in the cluster with the farthest center of gravity

12.jpg 37.jpg 51.jpg 60.jpg
000012.jpg 000037.jpg 000051.jpg 000060.jpg

Result consideration

Many of the images divided into the same cluster were long-haired women, and many of the images divided into the farthest clusters were short-haired men, so I think that clustering similar to human senses was possible. If you want to get a more rigorous image, you should calculate the norm directly with all images without principal component analysis, but it seems to be used this time to aim for serendipity and realize faster calculation. It seems good to consider the method.

Recommended Posts

Similar face image detection using face recognition and PCA and K-means clustering
Face image inference using Flask and TensorFlow
Image recognition using CNN Horses and deer
Machine Learning: Image Recognition of MNIST by using PCA and Gaussian Native Bayes
Tree disease determination by image recognition using CNTK and SVM
Try using scikit-learn (1) --K-means clustering
I tried face recognition using Face ++
[Image processing] Edge detection using Python and OpenCV makes Poo naked!
Face recognition using principal component analysis
python x tensoflow x image face recognition
Image recognition environment construction and basics
Explainable AI ~ Explainable k-Means and k-Medians Clustering ~
Face detection using a cascade classifier
Image recognition of fruits using VGG16
Face detection from multiple image files with openCV, cut out and save
Face recognition using OpenCV (Haar-like feature classifier)
100 language processing knock-97 (using scikit-learn): k-means clustering
Python: Basics of image recognition using CNN
Hello World and face detection with opencv-python 4.2
Category estimation using docomo's image recognition API
Python dlib face detection and blink counter
Python: Application of image recognition using CNN
Image recognition model using deep learning in 2016
Extract the color of the object in the image with Mask R-CNN and K-Means clustering
Can AI distinguish between Carlos Ghosn and Mr. Bean (face recognition using face landmarks)?