[PYTHON] Image collection by web scraping

This time I would like to write and explain the code that collects images from the specified URL using web scraping.

Implementation code

import requests
from requests.compat import urljoin
from bs4 import BeautifulSoup
import time
from PIL import Image
import urllib.request
import sys.os

class web_scryping:
    def __init__(self , url):
        self.url = url
        self.soup = BeautifulSoup(requests.get(self.url).content, 'lxml')
        
class download_images(web_scryping):
    def download(self , max_down_num):
        self.down_num = 0
        self.max_down_num = max_down_num
        self.save_path = './img/' + str(self.down_num+1) + '.jpg'
        now_num = 0
        for link in self.soup.find_all("img"):
            src_attr = link.get("src")
            target = urljoin(self.url, src_attr)
            resp = requests.get(target)
            image = resp.content
            #breakpoint()
            print(str(resp) + '  ' + str(now_num))
            now_num = now_num + 1
            if str(resp) != '<Response [404]>':
                with open(self.save_path, 'wb') as f:
                    f.write(image)
                self.down_num = self.down_num + 1
            time.sleep(1)
            self.save_path = './img/' + str(self.down_num+1) + '.jpg'
            if self.down_num == self.max_down_num:
                break
                
    def img_resize(self , img_path):
        try:            
            im = Image.open(img_path)
            print("Original image size width: {}, height: {}".format(im.size[0], im.size[1]))
            im_resize = im.resize(size=(800,1200))         
            im_resize.save(save_path)
            print('image resize sucess')
        except:
            print('image resize failed')


def main():
    url = sys.argv[0]
    di = download_images(url)
    di.download(50)

if __name__ == '__main__':
    main()

About the flow of the program

Step 1

def main():
    url = sys.argv[0]
    di = download_images(url)
    di.download(50)

Specify the URL as the first argument in the command line argument. Pass the URL to the download_images class, which inherits from the web_scryping class.

Step 2

class web_scryping:
    def __init__(self , url):
        self.url = url
        self.soup = BeautifulSoup(requests.get(self.url).content, 'lxml')

The download_images class inherits from the web_scryping class, and since the download_images class does not have an init method, the init method of the web_scryping class is started. Here, get the contents of the URL with the requests.get method and parse the contents of html with BeautifulSoup. Put the analysis result in a class variable called self.soup.

Step 3

class download_images(web_scryping):
    def download(self , max_down_num):
        self.down_num = 0
        self.max_down_num = max_down_num
        self.save_path = './img/' + str(self.down_num+1) + '.jpg'
        now_num = 0
        for link in self.soup.find_all("img"):
            src_attr = link.get("src")
            target = urljoin(self.url, src_attr)
            resp = requests.get(target)
            image = resp.content
            #breakpoint()
            print(str(resp) + '  ' + str(now_num))
            now_num = now_num + 1
            if str(resp) != '<Response [404]>':
                with open(self.save_path, 'wb') as f:
                    f.write(image)
                self.down_num = self.down_num + 1
            time.sleep(1)
            self.save_path = './img/' + str(self.down_num+1) + '.jpg'
            if self.down_num == self.max_down_num:
                break

Use the download method of the download_images class in step 1 to start the download. For self.save_path, specify the name of the image file in the img directory as if it were a number .jpg. self.soup.find_all ("img"): Find the img tag in the html. src_attr = link.get ("src"): Get the src item from the img tag. image = resp.content: The image variable will contain the image object. if str (resp)! ='<Response [404]>': Save the image if it is not 404 because the resp variable contains the result of the response. time.sleep (1): When scraping, it is not desirable to put a burden on the website, so use the sleep method to spare time.

bonus

def img_resize(self , img_path):
        try:            
            im = Image.open(img_path)
            print("Original image size width: {}, height: {}".format(im.size[0], im.size[1]))
            im_resize = im.resize(size=(800,1200))         
            im_resize.save(save_path)
            print('image resize sucess')
        except:
            print('image resize failed')

This method adjusts the size of the image, but I didn't use it because the resolution was not so good when I increased it.

Recommended Posts

Image collection by web scraping
One-liner web scraping by tse
web scraping
web scraping (prototype)
Image collection method
I tried to get an image by scraping
Get boat race match information by web scraping
Hinatazaka's blog image scraping
Introduction to Web Scraping
Scraping google search (image)
Python web scraping selenium
Web scraping with python + JupyterLab
Web scraping notes in python3
EXE Web API by Python
Save images with web scraping
Web scraping technology and concerns
Image collection by calling Bing Image Search API v5 from Python
Trade-offs in web scraping & crawling
Web crawling, web scraping, character acquisition and image saving with python
Web scraping using Selenium (Python)
Image processing by python (Pillow)
Image Processing Collection in Python
Web scraping using AWS lambda
Web scraping beginner with python
Algorithm-based web scraping library Scrapely
Collect only facial images of a specific person by web scraping
Get Splunk download link by scraping
Web scraping with Python ① (Scraping prior knowledge)
Web scraping with BeautifulSoup4 (layered page)
Web scraping with Python First step
I tried web scraping with python.
GAN: DCGAN Part1 --Scraping Web images
A collection of one-liner web servers
Environmentally friendly scraping using image processing
Let's do image scraping with Python
Beginners use Python for web scraping (1)
Web scraping for weather warning notifications.
Beginners use Python for web scraping (4) ―― 1
Nogizaka46 Get blog images by scraping
10 questions to check before web scraping
"Trash classification by image!" App creation diary day3 ~ Web application with Django ~