I tried [scraping] fashion images and text sentences in Python.

Introduction

Qiita's first post, so be gentle.

I am currently in my first year of master's degree, and I am thinking of doing something on the theme of deep learning and fashion in my research.

So, first of all, after studying, I decided to classify the images, and I have been trying to collect the data myself. (If you classify using an existing data set, yeah ...)

Execution environment

I tried scraping the image and product introduction as a set.

This time I used Python's lxml library to scrape images and text sentences. Far from scraping, I was a beginner in programming itself, so I referred to "Python Crawling & Scraping" published by Gijutsu-Hyoronsha. I made the code myself. If I knew the existence of Qiita from this time, it should have been solved by Qiita ... The code is below.

Supplement

This time I wanted to scrape only men's T-shirts, so I specified the category and gender. The ZOZOTOWN EC site has 135 products on each page, and clicking on each product will take you to the details page for that product. This time, I'm writing the code to scrape the text of the page to which I moved and the top image.

scraping_zozo_img_text.py


from typing import Iterator
from typing import List
import requests
import lxml.html
import time
import csv
import os

The following is the main execution.

csvlist = [['no', 'URL', 'item_text']]

i = 0
u = 0
j = 0
URL = "https://zozo.jp/men-category/tops/tshirt-cutsew/?pno="
for page in range(1, 100):
    time.sleep(1)
    pageUrl = "https://zozo.jp/men-category/tops/tshirt-cutsew/?pno=" + str(page)
    response = requests.get(pageUrl)

    #Get the URL of the detail page of each item on the list page ↓ ↓ ↓ ↓
    urls = scrape_item_page(response) #Each item(Corresponds to the image on the list page)The URL to the detail page of is obtained.


    for url in urls:
        j = j + 1
        time.sleep(1)

        #Pick up images and save them in a folder ↓↓↓↓↓↓↓
        img_url = get_image(url)
        w_img = requests.get(img_url)
        with open(str('picture_zozo/')+str(j)+str('.jpg'),'wb') as file:
            file.write(w_img.content)

        info = scrape_item_infomation(url)
        print(info)

        csvlist.append([j, url, info])


f = open("item_text.csv", 'w')
writecsv = csv.writer(f)

writecsv.writerows(csvlist)

f.close()


The response on the 10th line should have an image that contains various information in the URL page.

Details of the defined function

Function to get the URL for each item on each page


def scrape_item_page(response: requests.Response) -> Iterator[str]:
    html = lxml.html.fromstring(response.text)
    html.make_links_absolute(response.url)

    url=[]
    for a in html.cssselect('#searchResultList > li > div[class="catalog-item-container"] > a'):
        url.append(a.get('href'))

    return url

Line 12 of the main run. response.text is the full html code. You can get the HtmlElement directly by using the fromstring function. Rewrite relative links to absolute links with make_links_absolute. In the 6th line, use cssselect to follow the html tag and get the tag information including the URL of each product detail page. You can get the URL following ** href ** in the tag you got on line 7. (Get 135 URLs on each page.)

Function to get the text statement in the detail page of each item

#Define a function to access the URL of each item and get the product introduction

def scrape_item_infomation(url):
    response = requests.get(url)
    response.encoding = response.apparent_encoding
    html = lxml.html.fromstring(response.text)
    infomation = html.cssselect('#tabItemInfo > div[class="innerBox"] > div[class="contbox"]')
    info = infomation[0].text_content()

    return info

Select URLs one by one from 135 on the 15th line of the main to get the image and text. The encoding on the 5th line prevents garbled characters. The following is the same as before, and finally assigns a text sentence to ** info ** and returns it.

Function to get the image information (image URL) of each item

#Define a function that retrieves image information

def get_image(url): #List page URL
    response = requests.get(url)
    html = lxml.html.fromstring(response.text)
    html.make_links_absolute(response.url)
    image = html.cssselect('#photoMain > img')
    for img in image:
        img_url = img.get('src')
        print(img_url)

    return img_url

It is almost the same as the flow of getting text. Here you can get the information of the image.

On the 21st line of the main, I write the code to save the image in the folder. At this time, the script and the folder must be in the same hierarchy. Please note that if you do not create an empty folder in advance, you will get an error.

By the way, I created a folder called'picture_zozo'.

Finally, save the "number of the item (number)", "URL of the item", and "text text" in the CSV file. You can use it to check if the acquired image and the image on the page when you click the URL are the same product.

By the way, the image is saved like this. スクリーンショット 2020-05-22 0.23.19.png

By the way, the CSV file looks like this.

No URL text
1 https://〜 〇〇
2 https://〜 △△
3 https://〜 □□

The value of No and the value of ○ .jpg are the same.

Since the same URL is used below the for statement of the main execution, the combination of text and image is the same product.

Summary

I tried to post Qiita for the first time, but it is quite difficult to convey it in sentences. I haven't mastered it enough to explain it perfectly, so I think it's difficult to understand the explanation, but please forgive me. If the image is hard to come up, you should actually refer to the code and check it using print ().

Recommended Posts

I tried [scraping] fashion images and text sentences in Python.
I tried web scraping using python and selenium
I tried scraping with Python
I tried scraping with python
I tried web scraping with python.
Extract text from images in Python
I tried programming the chi-square test in Python and Java.
I created a class in Python and tried duck typing
Reading and writing text in Python
I tried Line notification in Python
I tried to implement PLSA in Python
I tried to implement permutation in Python
I tried scraping Yahoo News with Python
I tried to implement PLSA in Python 2
I tried using Bayesian Optimization in Python
I tried to implement ADALINE in Python
I tried to implement PPO in Python
I tried scraping Yahoo weather (Python edition)
Read text in images with python OCR
I tried scraping
I tried object detection using Python and OpenCV
I tried playing a typing game in Python
I tried simulating the "birthday paradox" in Python
I tried the least squares method in Python
I wrote a class in Python3 and Java
I tried Jacobian and partial differential with python
[Memo] I tried a pivot table in Python
I tried function synthesis and curry with python
I tried to implement TOPIC MODEL in Python
I tried non-blocking I / O Eventlet behavior in Python
I tried adding a Python3 module in C
I tried to implement selection sort in python
Clustering text in Python
I tried Python> autopep8
Text processing in Python
I tried Python> decorator
I tried to graph the packages installed in Python
I tried pipenv and asdf for Python version control
I tried using google test and CMake in C
I tried using TradeWave (BitCoin system trading in Python)
I tried to implement a pseudo pachislot in Python
I tried to implement Dragon Quest poker in Python
I was addicted to scraping with Selenium (+ Python) in 2020
I tried to implement GA (genetic algorithm) in Python
I tried to summarize how to use pandas in python
Python: I tried a liar and an honest tribe
Python OpenCV tried to display the image in text.
I tried "morphology conversion" of images with Python + OpenCV
I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.
I tried fp-growth with python
UTF8 text processing in python
I wrote python in Japanese
I tried the accuracy of three Stirling's approximations in python
Base64 encoding images in Python 3
I tried to create API list.csv in Python from swagger.yaml
Scraping with selenium in Python
[Python] Scraping in AWS Lambda
I tried to implement a one-dimensional cellular automaton in Python
Web scraping notes in python3
Scraping with chromedriver in python
I got an error in vim and zsh in Python 3.7 series