Qiita's first post, so be gentle.
I am currently in my first year of master's degree, and I am thinking of doing something on the theme of deep learning and fashion in my research.
So, first of all, after studying, I decided to classify the images, and I have been trying to collect the data myself. (If you classify using an existing data set, yeah ...)
This time I used Python's lxml library to scrape images and text sentences. Far from scraping, I was a beginner in programming itself, so I referred to "Python Crawling & Scraping" published by Gijutsu-Hyoronsha. I made the code myself. If I knew the existence of Qiita from this time, it should have been solved by Qiita ... The code is below.
This time I wanted to scrape only men's T-shirts, so I specified the category and gender. The ZOZOTOWN EC site has 135 products on each page, and clicking on each product will take you to the details page for that product. This time, I'm writing the code to scrape the text of the page to which I moved and the top image.
scraping_zozo_img_text.py
from typing import Iterator
from typing import List
import requests
import lxml.html
import time
import csv
import os
The following is the main execution.
csvlist = [['no', 'URL', 'item_text']]
i = 0
u = 0
j = 0
URL = "https://zozo.jp/men-category/tops/tshirt-cutsew/?pno="
for page in range(1, 100):
time.sleep(1)
pageUrl = "https://zozo.jp/men-category/tops/tshirt-cutsew/?pno=" + str(page)
response = requests.get(pageUrl)
#Get the URL of the detail page of each item on the list page ↓ ↓ ↓ ↓
urls = scrape_item_page(response) #Each item(Corresponds to the image on the list page)The URL to the detail page of is obtained.
for url in urls:
j = j + 1
time.sleep(1)
#Pick up images and save them in a folder ↓↓↓↓↓↓↓
img_url = get_image(url)
w_img = requests.get(img_url)
with open(str('picture_zozo/')+str(j)+str('.jpg'),'wb') as file:
file.write(w_img.content)
info = scrape_item_infomation(url)
print(info)
csvlist.append([j, url, info])
f = open("item_text.csv", 'w')
writecsv = csv.writer(f)
writecsv.writerows(csvlist)
f.close()
The response on the 10th line should have an image that contains various information in the URL page.
def scrape_item_page(response: requests.Response) -> Iterator[str]:
html = lxml.html.fromstring(response.text)
html.make_links_absolute(response.url)
url=[]
for a in html.cssselect('#searchResultList > li > div[class="catalog-item-container"] > a'):
url.append(a.get('href'))
return url
Line 12 of the main run. response.text is the full html code. You can get the HtmlElement directly by using the fromstring function. Rewrite relative links to absolute links with make_links_absolute. In the 6th line, use cssselect to follow the html tag and get the tag information including the URL of each product detail page. You can get the URL following ** href ** in the tag you got on line 7. (Get 135 URLs on each page.)
#Define a function to access the URL of each item and get the product introduction
def scrape_item_infomation(url):
response = requests.get(url)
response.encoding = response.apparent_encoding
html = lxml.html.fromstring(response.text)
infomation = html.cssselect('#tabItemInfo > div[class="innerBox"] > div[class="contbox"]')
info = infomation[0].text_content()
return info
Select URLs one by one from 135 on the 15th line of the main to get the image and text. The encoding on the 5th line prevents garbled characters. The following is the same as before, and finally assigns a text sentence to ** info ** and returns it.
#Define a function that retrieves image information
def get_image(url): #List page URL
response = requests.get(url)
html = lxml.html.fromstring(response.text)
html.make_links_absolute(response.url)
image = html.cssselect('#photoMain > img')
for img in image:
img_url = img.get('src')
print(img_url)
return img_url
It is almost the same as the flow of getting text. Here you can get the information of the image.
On the 21st line of the main, I write the code to save the image in the folder. At this time, the script and the folder must be in the same hierarchy. Please note that if you do not create an empty folder in advance, you will get an error.
By the way, I created a folder called'picture_zozo'.
Finally, save the "number of the item (number)", "URL of the item", and "text text" in the CSV file. You can use it to check if the acquired image and the image on the page when you click the URL are the same product.
By the way, the image is saved like this.
By the way, the CSV file looks like this.
No | URL | text |
---|---|---|
1 | https://〜 | 〇〇 |
2 | https://〜 | △△ |
3 | https://〜 | □□ |
The value of No and the value of ○ .jpg are the same.
Since the same URL is used below the for statement of the main execution, the combination of text and image is the same product.
I tried to post Qiita for the first time, but it is quite difficult to convey it in sentences. I haven't mastered it enough to explain it perfectly, so I think it's difficult to understand the explanation, but please forgive me. If the image is hard to come up, you should actually refer to the code and check it using print ().
Recommended Posts