BOOTH is known for selling many avatars for sale. As of December 09, 2019, the "3D model" tag There are 11,527 models. Of course, this does not mean the number of avatars as it is because it contains a lot of materials that are not related to avatars. VRC model database published by KingYoSun has about 1,600 models. Is registered, but I think this is the most appropriate at the moment.

Is it possible to distinguish this from the thumbnail image? It seems that it can be done by recognizing the face, but is it possible to acquire only the face as an independent image?

Scraping

That's why I'm scraping first. As expected, it is the one that puts the code that works with copy and paste, so only the URL is hidden.

import urllib.request as ur
from bs4 import BeautifulSoup
import requests

images = []

def img_save(img_url,title):
    url = img_url
    file_name = str(len(images)) + ".jpg "
    labeled_name = str(len(images)) + "___" + title + ".jpg "
    response = requests.get(url)
    image = response.content
    #This is just a serial number
    with open("data/" + file_name, "wb") as o:
        o.write(image)
    #This one has a title
    with open("labeled_data/" + labeled_name, "wb") as o:
        o.write(image)

def img_search(url_data):
    url = url_data
    html = ur.urlopen(url)
    soup = BeautifulSoup(html, "html.parser")
    title = str(soup.title.text)
    char_list = ["/","'",'"',"*","|","<",">","?","\\"," - BOOTH"]
    for c in char_list:
        title = title.replace(c,"")
    print(title)
    for s in soup.find_all("img"):
        if str(s).find("market") > 0:
            img_url = s.get("src")
            if img_url is not None:
                print(img_url)
                images.append(img_url)
                img_save(img_url,title)
                break

def page_access(page_number):
    url = page_number
    html = ur.urlopen(url)
    soup = BeautifulSoup(html, "html.parser")
    for s in soup.find_all("a"):
        if str(s).find("item-card__title-anchor") > 0:
            print (s.get("href"))
            url = s.get("href")
            img_search(url)

for i in range(1,240):
    url = "***I can't put it***" + str(i)
    page_access(url)

The result obtained in this way is as follows.

There are about 11,000 sheets.

Face detection

Face detection is performed using the OpenCV library.

import cv2

sample = 11000

for i in range(sample):
    file_name = 'data/' + str(i+1) + '.jpg'
    img = cv2.imread(file_name)
    img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")
    faces = cascade.detectMultiScale(img_gray,minSize=(100, 100))
    color = (0, 0, 0)
    print(faces)
    if len(faces) > 0:
        for rect in faces:
            cv2.rectangle(img, tuple(rect[0:2]),tuple(rect[0:2]+rect[2:4]), color, thickness=10)
        output_path = "face_detect/" + str(i+1) + ".jpg "
        cv2.imwrite(output_path, img)

The face detection model needs to be downloaded separately and arranged locally. It's haarcascade_frontalface_default.xml in the code above. You can download it from OpenCV github.

The result is below.

The accuracy is not good at all! I missed my face, or on the contrary, I misunderstood something different.

Anime face model

This is because the face detection model assumes a live-action face. When I searched for it, there was a person who created Model for Anime Face Detection. God? So I'll try again.

import cv2

sample = 11000

for i in range(sample):
    file_name = 'data/' + str(i+1) + '.jpg'
    img = cv2.imread(file_name)
    img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    cascade = cv2.CascadeClassifier("lbpcascade_animeface.xml") #Here is changing
    faces = cascade.detectMultiScale(img_gray,minSize=(100, 100))
    color = (0, 0, 0)
    print(faces)
    if len(faces) > 0:
        for rect in faces:
            cv2.rectangle(img, tuple(rect[0:2]),tuple(rect[0:2]+rect[2:4]), color, thickness=10)
    output_path = "face_detect/real" + str(i+1) + ".jpg "
    cv2.imwrite(output_path, img)

Execution result.

The accuracy is too high!

Trim based on this detection result.

import cv2

sample = 11000
count = 1

for i in range(sample):
    file_name = 'data/' + str(i+1) + '.jpg'
    img = cv2.imread(file_name)
    img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    classifier = cv2.CascadeClassifier("lbpcascade_animeface.xml")
    faces = classifier.detectMultiScale(img_gray, minSize=(100, 100))
    print(faces)
    if len(faces) > 0:
        for x,y,w,h in faces:
            face_image = img[y:y+h, x:x+w]
            output_path = 'face_trim/' + str(count) + '.jpg'
            cv2.imwrite(output_path,face_image)
            count += 1

Execution result.

...... I'm dizzy because there are too many avatars.

Future outlook

Since I got a lot of face icons, I could only do ghosts when I used the method I did the other day, so I don't use methods such as GAN. It seems that an interesting picture will not come out. Will study.

Approximately 3,000 images were generated, but since one thumbnail has multiple faces and a good number of special clothes (that is, thumbnails have faces) are sold, the actual avatar is There should be less. About half, about 1,600 points mentioned at the beginning seems to be a reasonable number. I thought it would be interesting to combine it with character recognition (thumbnails have a lot of sales complaints), but I would like to make it a future issue.

Also, it would be interesting to create a web service that displays only faces at random and makes it easy to search for avatars with your favorite faces from a large number of avatars for sale.

Articles that were taken care of

[Explanation for beginners] OpenCV face detection mechanism and practice (detectMultiScale) Anime face detection with OpenCV

[PYTHON] When I scraped the thumbnail of BOOTH and detected the face with OpenCV, the accuracy was too good and I was scared.

Scraping

Face detection

Anime face model

Future outlook

Articles that were taken care of