[PYTHON] First Deep Learning ~ Preparation ~

Nice to meet you. @best_not_best. The other day, when I talked about Chainer at an in-house study session, I got a surprisingly good response. I would like to summarize the details in this article.

Thing you want to do

You guys have one or two of your favorite celebrities, right? (I'll talk on the premise that I'm there.) But I'm sure it's unlikely that I'll be able to meet that person in person. If there is a person who is close to you ... and if you can get to know that person ...

Caution

We have scraped our internal site below, but this article does not endorse that. Please read it as a story to the last. We are not responsible for any damage caused by actually performing this article. Observe your own internal information security rules and enjoy working.

environment

procedure

  1. Collect images of employees
  2. Cut out the face part of the collected employee images
  3. Collect learning images (favorite entertainers)
  4. Cut out the face part of the learning image
  5. Create a discriminator by learning 4. with Python + Chainer
  6. Let the discriminator discriminate the image in 2.

Practice

1. Collect images of employees

I think your company's intra site has an employee search function. Search for a suitable employee from there and look up the URL of the employee image. If the employee ID is included in the URL, such as http://hogehoge.co.jp/image/12345.jpg. Depending on the company, the ID may be hashed with MD5 etc. Anyway, find the relevance between the employee ID and the image URL. (If you can't find it, give up ...)

Next, look for a list of employee IDs. If you press the search button without entering anything in the search form, a list may appear. Scrap the list page.

abstraction_id.py


#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import lxml.html
from selenium import webdriver

TARGET_URL = 'http://hogehoge.co.jp/list.html'
driver = webdriver.PhantomJS()
driver.get(TARGET_URL)
root = lxml.html.fromstring(driver.page_source)
links = root.cssselect('p.class')
for link in links:
    if link.text is None:
        continue
    if link.text.isdigit():
        print link.text

Execute it with the following command.

$ python abstraction_id.py > member_id.txt

The part of target_url ='http://hogehoge.co.jp/list.html' can be a local file path, so scraping after saving the page is also possible. Enter the HTML element name that describes the employee ID in root.cssselect (). This time, there were relevant elements in multiple parts of HTML, so we are determining the conditions. This is determined when the employee ID is only numbers, but please replace it with a regular expression as appropriate.

The image is acquired locally using the acquired ID list.

image_crawler.py


#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from urllib2 import Request, urlopen, URLError, build_opener
import os
import time

ID_LIST = './member_id.txt'
URL_FMT = 'http://hogehoge.co.jp/image/%s.jpg'
OUTPUT_FMT = './photos/%s.jpg'
opener = build_opener()

for id in open(ID_LIST, 'r'):
    url = URL_FMT % id.strip()
    output = OUTPUT_FMT % id.strip()

    req = Request(url)
    try:
        response = urlopen(req)
    except URLError, e:
        if hasattr(e, 'reason'):
            err = e.reason
        elif hasattr(e, 'code'):
            err = e.code
    else:
      file = open(output, 'wb')
      file.write(opener.open(req).read())
      file.close()

    time.sleep(0.1)

Execute it with the following command.

$ python image_crawler.py

Just in case, let's put time.sleep (). ʻOUTPUT_FMT` will be the storage directory, so select it as appropriate.

2. Cut out the face part of the collected employee images

I will cut it out using OpenCV. I referred to the following article. Py-opencv Cut out a part of the image and save it --Symfoware

cutout_face.py


#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import numpy
import os
import cv2

CASCADE_PATH = '/usr/local/opt/opencv/share/OpenCV/haarcascades/haarcascade_frontalface_alt.xml'
INPUT_DIR_PATH = './photos/'
OUTPUT_DIR_PATH = './cutout/'
OUTPUT_FILE_FMT = '%s%s_%d%s'
COLOR = (255, 255, 255)

files = os.listdir(INPUT_DIR_PATH)
for file in files:
    input_image_path = INPUT_DIR_PATH + file

    #File reading
    image = cv2.imread(input_image_path)
    #Grayscale conversion
    try:
        image_gray = cv2.cvtColor(image, cv2.cv.CV_BGR2GRAY)
    except cv2.error:
        continue

    #Acquire the features of the cascade classifier
    cascade = cv2.CascadeClassifier(CASCADE_PATH)

    #Execution of object recognition (face recognition)
    facerect = cascade.detectMultiScale(image_gray, scaleFactor=1.1, minNeighbors=1, minSize=(1, 1))

    if len(facerect) > 0:
        #Saving recognition results
        i = 1
        for rect in facerect:
            print rect
            x = rect[0]
            y = rect[1]
            w = rect[2]
            h = rect[3]

            path, ext = os.path.splitext(os.path.basename(file))
            output_image_path = OUTPUT_FILE_FMT % (OUTPUT_DIR_PATH, path, i, ext)
            cv2.imwrite(output_image_path, image[y:y+h, x:x+w])

            i += 1

Execute it with the following command.

$ python cutout_face.py

ʻINPUT_DIR_PATH is the storage directory in the previous section, and ʻOUTPUT_DIR_PATH is the storage directory of the extracted file, so select it as appropriate. ʻImportError: No module named cv2`

import cv2

To

import sys
sys.path.append('/usr/local/lib/python2.7/site-packages')
import cv2

I think that it can be avoided by rewriting.

I think that you can cut out the face part in most images, but in some cases, the tie part may be recognized as a face as shown below. This is a future issue. 110001_2.jpg

next time

That's all for this time. (I'm sorry halfway ...) Continuing from the 21st day article of Intelligence Advent Calendar 2015!

Postscript

solved! → First Deep Learning ~ Solution ~ --Qiita

Recommended Posts

First Deep Learning ~ Preparation ~
First Deep Learning ~ Struggle ~
First Deep Learning ~ Solution ~
Deep Learning
Introduction to Deep Learning ~ Coding Preparation ~
Deep Learning Memorandum
Start Deep learning
Python Deep Learning
Deep learning × Python
Python: Deep Learning Practices
Deep learning / activation functions
Deep Learning from scratch
Deep learning 1 Practice of deep learning
[AI] Deep Metric Learning
I tried deep learning
Python: Deep Learning Tuning
Deep learning large-scale technology
First deep learning in C #-Imitating implementation in Python-
Deep learning / softmax function
Deep Learning from scratch 1-3 chapters
Deep Learning Gaiden ~ GPU Programming ~
<Course> Deep Learning: Day2 CNN
Deep running 2 Tuning of deep learning
Reinforcement learning 6 First Chainer RL
Machine learning with Python! Preparation
Rabbit Challenge Deep Learning 1Day
<Course> Deep Learning: Day1 NN
Deep Kernel Learning with Pyro
Try Deep Learning with FPGA
Deep learning for compound formation?
Introducing Udacity Deep Learning Nanodegree
Subjects> Deep Learning: Day3 RNN
Introduction to Deep Learning ~ Learning Rules ~
Rabbit Challenge Deep Learning 2Day
Reinforcement learning 4 CartPole first step
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Deep reinforcement learning 2 Implementation of reinforcement learning
Generate Pokemon with Deep Learning
Introduction to Deep Learning ~ Backpropagation ~
Deep Learning Model Lightening Library Distiller
Deep Learning / Deep Learning from Zero 2 Chapter 4 Memo
Cat breed identification with deep learning
Deep Learning / Deep Learning from Zero Chapter 3 Memo
Make ASCII art with deep learning
Deep Learning / Deep Learning from Zero 2 Chapter 5 Memo
Implement Deep Learning / VAE (Variational Autoencoder)
Introduction to Deep Learning ~ Function Approximation ~
Try deep learning with TensorFlow Part 2
Deep learning from scratch (cost calculation)
About Deep Learning (DNN) Project Management
Deep learning to start without GPU
Solve three-dimensional PDEs with deep learning.
Organize machine learning and deep learning platforms
Deep learning learned by implementation 1 (regression)
Deep Learning / Deep Learning from Zero 2 Chapter 7 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 8 Memo
Microsoft's Deep Learning Library "CNTK" Tutorial
Deep Learning / Deep Learning from Zero Chapter 5 Memo
Check squat forms with deep learning
Deep Learning / Deep Learning from Zero Chapter 4 Memo
Deep Reinforcement Learning 3 Practical Edition: Breakout