[PYTHON] When I scraped the thumbnail of BOOTH and detected the face with OpenCV, the accuracy was too good and I was scared.

BOOTH is known for selling many avatars for sale. As of December 09, 2019, the "3D model" tag There are 11,527 models. Of course, this does not mean the number of avatars as it is because it contains a lot of materials that are not related to avatars. VRC model database published by KingYoSun has about 1,600 models. Is registered, but I think this is the most appropriate at the moment.

Is it possible to distinguish this from the thumbnail image? It seems that it can be done by recognizing the face, but is it possible to acquire only the face as an independent image?

Scraping

That's why I'm scraping first. As expected, it is the one that puts the code that works with copy and paste, so only the URL is hidden.

import urllib.request as ur
from bs4 import BeautifulSoup
import requests

images = []

def img_save(img_url,title):
    url = img_url
    file_name = str(len(images)) + ".jpg "
    labeled_name = str(len(images)) + "___" + title + ".jpg "
    response = requests.get(url)
    image = response.content
    #This is just a serial number
    with open("data/" + file_name, "wb") as o:
        o.write(image)
    #This one has a title
    with open("labeled_data/" + labeled_name, "wb") as o:
        o.write(image)

def img_search(url_data):
    url = url_data
    html = ur.urlopen(url)
    soup = BeautifulSoup(html, "html.parser")
    title = str(soup.title.text)
    char_list = ["/","'",'"',"*","|","<",">","?","\\"," - BOOTH"]
    for c in char_list:
        title = title.replace(c,"")
    print(title)
    for s in soup.find_all("img"):
        if str(s).find("market") > 0:
            img_url = s.get("src")
            if img_url is not None:
                print(img_url)
                images.append(img_url)
                img_save(img_url,title)
                break

def page_access(page_number):
    url = page_number
    html = ur.urlopen(url)
    soup = BeautifulSoup(html, "html.parser")
    for s in soup.find_all("a"):
        if str(s).find("item-card__title-anchor") > 0:
            print (s.get("href"))
            url = s.get("href")
            img_search(url)

for i in range(1,240):
    url = "***I can't put it***" + str(i)
    page_access(url)

The result obtained in this way is as follows.

image.png

There are about 11,000 sheets.

Face detection

Face detection is performed using the OpenCV library.

import cv2

sample = 11000

for i in range(sample):
    file_name = 'data/' + str(i+1) + '.jpg'
    img = cv2.imread(file_name)
    img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")
    faces = cascade.detectMultiScale(img_gray,minSize=(100, 100))
    color = (0, 0, 0)
    print(faces)
    if len(faces) > 0:
        for rect in faces:
            cv2.rectangle(img, tuple(rect[0:2]),tuple(rect[0:2]+rect[2:4]), color, thickness=10)
        output_path = "face_detect/" + str(i+1) + ".jpg "
        cv2.imwrite(output_path, img)

The face detection model needs to be downloaded separately and arranged locally. It's haarcascade_frontalface_default.xml in the code above. You can download it from OpenCV github.

The result is below.

image.png

The accuracy is not good at all! I missed my face, or on the contrary, I misunderstood something different.

Anime face model

This is because the face detection model assumes a live-action face. When I searched for it, there was a person who created Model for Anime Face Detection. God? So I'll try again.

import cv2

sample = 11000

for i in range(sample):
    file_name = 'data/' + str(i+1) + '.jpg'
    img = cv2.imread(file_name)
    img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    cascade = cv2.CascadeClassifier("lbpcascade_animeface.xml") #Here is changing
    faces = cascade.detectMultiScale(img_gray,minSize=(100, 100))
    color = (0, 0, 0)
    print(faces)
    if len(faces) > 0:
        for rect in faces:
            cv2.rectangle(img, tuple(rect[0:2]),tuple(rect[0:2]+rect[2:4]), color, thickness=10)
    output_path = "face_detect/real" + str(i+1) + ".jpg "
    cv2.imwrite(output_path, img)

Execution result.

image.png

The accuracy is too high!

Trim based on this detection result.

import cv2

sample = 11000
count = 1

for i in range(sample):
    file_name = 'data/' + str(i+1) + '.jpg'
    img = cv2.imread(file_name)
    img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    classifier = cv2.CascadeClassifier("lbpcascade_animeface.xml")
    faces = classifier.detectMultiScale(img_gray, minSize=(100, 100))
    print(faces)
    if len(faces) > 0:
        for x,y,w,h in faces:
            face_image = img[y:y+h, x:x+w]
            output_path = 'face_trim/' + str(count) + '.jpg'
            cv2.imwrite(output_path,face_image)
            count += 1

Execution result.

image.png

...... I'm dizzy because there are too many avatars.

Future outlook

Since I got a lot of face icons, I could only do ghosts when I used the method I did the other day, so I don't use methods such as GAN. It seems that an interesting picture will not come out. Will study.

image.png

Approximately 3,000 images were generated, but since one thumbnail has multiple faces and a good number of special clothes (that is, thumbnails have faces) are sold, the actual avatar is There should be less. About half, about 1,600 points mentioned at the beginning seems to be a reasonable number. I thought it would be interesting to combine it with character recognition (thumbnails have a lot of sales complaints), but I would like to make it a future issue.

Also, it would be interesting to create a web service that displays only faces at random and makes it easy to search for avatars with your favorite faces from a large number of avatars for sale.

Articles that were taken care of

[Explanation for beginners] OpenCV face detection mechanism and practice (detectMultiScale) Anime face detection with OpenCV

Recommended Posts

When I scraped the thumbnail of BOOTH and detected the face with OpenCV, the accuracy was too good and I was scared.
I want to check the position of my face with OpenCV!
What I did when I couldn't find the feature point with the optical flow of opencv and when I lost it
I tried "gamma correction" of the image with Python + OpenCV
A memo when face is detected with Python + OpenCV quickly
I tried face recognition with OpenCV
I replaced the numerical calculation of Python with Rust and compared the speed
I vectorized the chord of the song with word2vec and visualized it with t-SNE
Read the graph image with OpenCV and get the coordinates of the final point of the graph
Get and estimate the shape of the head using Dlib and OpenCV with python
I measured the speed of list comprehension, for and while with python2.7.
Try to separate the background and moving object of the video with OpenCV
I tried to compare the accuracy of Japanese BERT and Japanese Distil BERT sentence classification with PyTorch & Introduction of BERT accuracy improvement technique
The result was better when the training data of the mini-batch was made a hybrid of fixed and random with a neural network.