[PYTHON] Sort anime faces by scraping anime character pages with Beautiful Soup and Selenium

Aikatsu! I wanted to use a photo of the character's face to draw the series analysis results, but the number of people is large and manual operation is troublesome. So I got scraping with Beautiful Soup and Print Screen with Selenium. I decided to do it automatically until the character's face is cut out from PrintScreen with OpenCV.

Scraping department

First of all, advance preparation system

from urllib import request
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
import os
import shutil
import itertools

#OpenCV does not allow Japanese filenames, so load the mapping file
df=pd.read_csv("C:/XXXX/aikatsu_name_romaji_mapping.tsv", sep='\t', engine='python', encoding="utf-8")

#Load Chrome driver
driver = webdriver.Chrome("C:/XXXX/chromedriver/chromedriver.exe") 

#Have a tuple URL to scrape
character_urls =(
    "http://www.aikatsu.net/01/character/index.html",
    "http://www.aikatsu.net/02/character/index.html",
    "http://www.aikatsu.net/03/character/index.html",
    "http://www.aikatsu.net/aikatsustars_01/character/index.html",
    "http://www.aikatsu.net/aikatsustars_02/character/index.html",
    "http://www.aikatsu.net/aikatsufriends_01/character/",
    "http://www.aikatsu.net/aikatsufriends_02/character/",
    "http://www.aikatsu.net/character/"
)

#Creating a directory for storing PrintScreen
target_dir = "C:/XXXX/download/"

if os.path.isdir(target_dir):
    shutil.rmtree(target_dir)
    time.sleep(1)
os.mkdir(target_dir)

It might have been better to make the directory creation part as a function.

The mapping is as simple as this.

I use Pandas only because the number of target characters is about 67 and there is no need to make it DB or rich, and I can make it quickly with just knowledge.

This and that when using Selenium

When using Selenium, the storage location of the driver is usually set in the environment variable, but since it is a disposable tool, it does not need to be so rich. So, while referring to the following, I wrote the driver in solid. Introduction to Selenium starting with just 3 lines of python

In addition, the following error occurred at the time of execution. WebDriverError: unknown error: Runtime.executionContextCreated has invalid This is solved by matching the version of the driver you are using with the version of chrome because it is different from the version of chrome.

Scraping Implementation Department

for character_url in character_urls:
    html = request.urlopen(character_url)
    soup = BeautifulSoup(html, "html.parser")
    
    #Get information about each character
    characters=soup.find_all("a")
    idol_names = [i.find('img') for i in characters]
    urls = [i.get('href') for i in characters]
    
    character_url_prefix=character_url.split("index.html")
    
    for i, j in zip(idol_names, urls):
        #If the alt tag cannot be taken correctly, the process is rejected.
        if i == None:
            continue
        #Repel information other than characters
        if j.startswith("http") or j.startswith("../")  or j.startswith("index"):
            continue

        idol_name = i.get("alt").replace("　","").replace(" ","")
        print(idol_name)
        
        #Page display and adjustment of selenium
        driver.get(character_url_prefix[0]+j) 
        driver.set_window_size(1250, 1036)
        driver.execute_script("document.body.style.zoom='90%'")

        #Shirayuri Kaguya has an empty alt information, so set a fixed value
        if idol_name == "":
            idol_name = "Shirayuri Kaguya"
            
        #OpenCV cannot use Japanese names, so convert it to Romaji
        idol_name_romaji = df[df["character"]==idol_name]["romaji"].values[0]

        file_name="{}{}.png ".format(target_dir, idol_name_romaji)

        #If a file with the same name already exists, rename it.
        if os.path.exists(file_name):
            for i in itertools.count(1):
                newname = '{} ({})'.format(idol_name_romaji, i)
                file_name="{}{}.png ".format(target_dir, newname)

                #Exit if the file with the same name does not exist
                if not os.path.exists(file_name):
                    break
                    
        #Set a slightly longer sleep time to avoid effects when transitioning to web pages
        time.sleep(5)            
        driver.save_screenshot(file_name)

driver.quit()

You can get the data like this. Originally, the Japanese name of alt was added to the file name, but since it can not be read by OpenCV, it is purposely converted to Roman alphabet notation. (Romaji is appropriate, so you may make a mistake)

Also, when getting the character names (idol_names), None will be obtained as shown below. Since the URL and character name of each character are looped with a zip, it is necessary to have the same number of elements, so I try to play inside instead of before the loop.

[<img alt="Aikatsu on Parade!" src="../images/logo.png "/>,
 <img alt="Aikatsu on Parade! communication" src="../images/bt-aikatsuonparadecom.png "/>,
 <img alt="Aikatsu on Parade! What is" src="../images/bt-aikatsuonparade.png "/>,
 <img alt="Broadcast information" src="../images/bt-tvinfo.png "/>,
 <img alt="character" src="../images/bt-character.png "/>,
 <img alt="Story" src="../images/bt-story.png "/>,
 <img alt="CD" src="../images/bt-cd.png "/>,
 <img alt="BD/DVD" src="../images/bt-bddvd.png "/>,
 <img alt="NEWS" src="../images/bt-news.png "/>,
 <img alt="TOP" src="../images/bt-top.png "/>,
 <img alt="Raki Kiseki" src="images/bt-raki.png "/>,
 <img alt="Yuki Aine" src="images/bt-aine.png "/>,
 <img alt="Mio Minato" src="images/bt-mio.png "/>,
 <img alt="Hoshimiya Ichigo" src="images/bt-ichigo.png "/>,
 <img alt="Akari Ozora" src="images/bt-akari.png "/>,
 <img alt="Yume Nijino" src="images/bt-yume.png "/>,
 <img alt="BANDAINAMCO Pictures" height="53" src="../images/bnp.png " width="118"/>,
 None]

OpenCV section

import os
import cv2
from pathlib import Path

#Creating a directory
download_dir = '{0}parse/'.format(target_dir)

if os.path.isdir(download_dir):
    shutil.rmtree(download_dir)
    time.sleep(1)
os.mkdir(download_dir)

#Create a classifier based on the feature file
classifier = cv2.CascadeClassifier('C:/XXX/lbpcascade_animeface.xml')

#Get files in scraped directory
p = Path(target_dir)

for i in list(p.glob("*.png ")): 
    #Face detection
    image = cv2.imread(i.as_posix())
    
    #Creating a directory
    file_tmp=i.as_posix().split("/")
    parse_dir = '{0}{1}/'.format(download_dir, file_tmp[len(file_tmp)-1:len(file_tmp)][0].split(".")[0])
    os.mkdir(parse_dir)
    
    #Grayscale
    gray_image = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
    faces = classifier.detectMultiScale(gray_image)

    for i, (x,y,w,h) in enumerate(faces):
        #Cut out the face one by one. Adjust the y coordinate to make it rectangular
        face_image = image[y-50:y+h, x:x+w]
        output_path = '{0}{1}.png'.format(parse_dir, i)
        #writing
        cv2.imwrite(output_path ,face_image)

I wanted to make OpenCV a rectangle, so I just tweaked the coordinates a little and the contents are as follows. Anime face detection with OpenCV

You can get it like this. Due to the pose, some of them were not identified as characters by the classifier. Since there are only a few people, I wonder if I have to do it manually.