[PYTHON] Collect only facial images of a specific person by web scraping

I started studying machine learning in April, and sometimes I wish I had a face image of a specific person, so I decided to collect it by web scraping. Perhaps I can find such a dataset if I search for it, but I wanted to do it with a photo of my favorite person's face anyway. Since the face detection of openCV alone extracts the non-face part and collects the face images of unrelated people, the purpose is to add a face recognition system and collect only the face images of a specific person.

Before running the program

Notes

Web scraping is a lot of gray, so be careful not to overload the server when you run it.

Environment / version

Winsows10 Anaconda3 Python 3.5.6 cmake 3.17.1 dlib 19.19.0 face-recognition 1.3.0 opencv-python 4.2.0.34

About execution result

I will write the results in detail later, but this program collects face images using face recognition, but the feeling I have experimented with is the face of a specific person with almost 100% accuracy. I got an image, but if it's a Japanese face (or rather an Asian), the accuracy will drop. When collecting Japanese faces, it is necessary to manually sort them after execution.

How to use the program

The program code has been published on github, so please download it from there. If you want to do it because the explanation is good, please read the simple procedure on github and execute it.

This time I would like to collect facial images of ** Ai Shinozaki, the cutest in Japan **

Install library

This time we will use a library called face_recognize. This requires cmake and dlib as described in the Documentation (https://github.com/ageitgey/face_recognition). All you have to do is add the required libraries.

Description of main functions

getExternalLinks


def getExternalLinks(page):
    externalLinks = []
    for url in search(name, lang="jp", start=(page-1)*10, stop=10,pause = 2.0):
        externalLinks.append(url)
    return externalLinks

It is a function that acquires and returns 10 URLs of sites that appear in the search word using the google-search library. This time, the search word is "Ai Shinozaki image" and stored in the variable name.

DownloadImage You can actually access it from the link obtained by the previous function and get the image uploaded there.

def DownloadImage(externalLinks):
    for url in externalLinks:
        global num
        header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
        req = Request(url, headers=header)
        try:
            html = urlopen(req).read()
        except HTTPError as err:
            return None
        bs = BeautifulSoup(html, 'html.parser')
        #get path of image
        downloadList = bs.find_all('img')
        for download in downloadList:
            try:
                #convert relative path to absolute path
                fileUrl = getAbsoluteURL(url, download['src'])
            except:
                continue
            if fileUrl is not None:
                #get all faces in the picture
                face_list = face_detect(fileUrl)
                print(fileUrl)
                if face_list is None:
                    continue
                for face in face_list:
                    #judge face
                    result = face_recog(face)
                    true_num = 0
                    #when the number of True is more than the threshold, write the image in local
                    for i in result:
                        true_num += i*1
                    if true_num >= threshold:
                        try:
                            if ".png " in fileUrl:
                                cv2.imwrite(downloadDirectory+str(num)+".png ", face)
                            else:
                                cv2.imwrite(downloadDirectory+str(num)+".jpg ", face)
                            num += 1
                        except:
                            print("Fail to save")
            time.sleep(1)
    return None

First, get all the images uploaded on the site, cut out the face image from each, and get it. Then, the face_recog function defines whether or not the person is the person for the face image.

face_recog

def face_recog(face):

    #load some images that contain target's face
    sample_image = face_recognition.load_image_file("sample_image/<image file>")
    sample_image1 = face_recognition.load_image_file("sample_image/<image file>")
    sample_image2 = face_recognition.load_image_file("sample_image/<image file>")
    sample_image3 = face_recognition.load_image_file("sample_image/<image file>")
    sample_image4 = face_recognition.load_image_file("sample_image/<image file>")

    sample_image = face_recognition.face_encodings(sample_image)[0]
    sample_image1 = face_recognition.face_encodings(sample_image1)[0]
    sample_image2 = face_recognition.face_encodings(sample_image2)[0]
    sample_image3 = face_recognition.face_encodings(sample_image3)[0]
    sample_image4 = face_recognition.face_encodings(sample_image4)[0]

    try:
        unknown_image = face_recognition.face_encodings(face)[0]
    except:
        return [False]
    
    known_faces = [
        sample_image,
        sample_image1,
        sample_image2,
        sample_image3,
        sample_image4
    ]

    results = face_recognition.compare_faces(known_faces, unknown_image)
    return results

Acquire an image showing the face of the person you want to take in advance for face recognition. This time, in order to improve the accuracy, we will acquire 5 images, compare each with the input face image, and return the result of each.

And this time, if 2 or more out of 5 images are True, I will save the image. (If it happens to be one, different people who happen to match will be saved, and if it is 5 out of 5, the accuracy will increase, but the fraction will decrease) And on github, there is also a process to delete the exact same image at the end.

result

In the case of Ai Shinozaki

Here is the result of collecting images for one page of google search. キャプチャ.PNG

As you can see, the face images of quite different people are saved.

For Scarlett Johansson

Next, I will experiment with Scarlett Johansson, the sexiest and most beautiful overseas. キャプチャ2.PNG

This is brilliantly filled with Scarlett Johansson, Specifically, 342 out of 343 were Scarlett Johansson. This is considerably more accurate than when Ai Shinozaki was. Maybe the face_recongnize library is for foreigners' faces and not suitable for identifying Asian faces.

I would like to experiment and try it with various people in the future.

Recommended Posts

Collect only facial images of a specific person by web scraping
Collect images by scraping. Make more videos!
Fixed-point observation of specific data on the Web by automatically executing a Web browser on the server (Ubuntu16.04) (2) -Web scraping-
Get a list of Qiita likes by scraping
Save images with web scraping
Image collection by web scraping
One-liner web scraping by tse
A memorandum of scraping & machine learning [development technique] by Python (Chapter 4)
A memorandum of scraping & machine learning [development technique] by Python (Chapter 5)
GAN: DCGAN Part1 --Scraping Web images
Nogizaka46 Get blog images by scraping
Fixed-point observation of specific data on the Web by automatically executing a Web browser on the server (Ubuntu16.04) (3) ~ Cron automatic execution ~