Get images of great find / 47 sites using Python (1/2: until target list creation)

Postscript after publication of the article

First of all, get the deliverables here 2nd article, and then the following story is OK.

at first

Do you know the wonderful site find / 47? Beautiful landscape images of 47 prefectures nationwide are provided in high quality. According to our survey, as of the end of October 2020, there are a total of 1080 types nationwide in Japan, and it is possible to download these by resolution of S, M, L, XL (depending on the image). However, if you take a quick look at the number of public accesses, it seems that it is not fully recognized and utilized, and it is regrettable. So in this article (2 times +), I'll use Python to (1) create a list of images, (2) download it, and (3) use Ubuntu / LXDE for [variety](https://peterlevi. I will introduce how to realize an auto-change environment for wallpapers using com / variety /). At a level, it's an introductory "something you want to achieve with scraping, implement it in Python, and check it." I myself haven't done much technique or optimization. Please note. Ultimately, I'm happy to realize a desktop like the one in the image below (as a personal hobby). That's all. image.png

Points to note

This first article will not affect the execution, but you will need about 10GB of free space to download the images. In the second or third article, I will touch on how to reduce the resolution and reduce the image download capacity.

Operating environment

Please do it appropriately.

Ubuntu

$ cat /etc/issue
Ubuntu 20.04.1 LTS \n \l

Python

$ python3 --version
Python 3.8.5

installed by pip3 There may be others. Look at the beginning of the code and take action.

$ pip3 list
beautifulsoup4               4.8.2
requests                     2.22.0
tqdm                         4.50.2

Operation outline and code

Operation overview

There are 6 steps to save the image. In this 1/2 article, we will do 1-3 steps. Specifically, it outputs a list of images to be downloaded to a text file in a somewhat readable form. First, while page-turning, make it in a comprehensively unreadable form (stage 01, in-memory), add size information to each line of the list (stage 02, in-memory), and output it in CSV format (stage 03, in-memory). File output). So far, we have obtained what size (xl, x, m, s) of what file name (Note: no extension) in what region (0-7), what prefecture (0-46), and in total. Determine how many images you can get.

コード(1)01_generate_urls.py

Create an appropriate folder (eg /home/nekoneko/codes/python/find47) and save it with the file name 01_generate_urls.py directly under it.

#!/usr/bin/env python3
# coding+ utf-8

import csv
import re
import requests
import subprocess
import time
from bs4 import BeautifulSoup

# e.g. https://search.find47.jp/ja/images?area=kinki&prefectures=kyoto&page=3

# declare variables

base_url            =   'https://search.find47.jp/ja/images?'

valid_urls          = []
target_urls         = []

areas               = [ 'hokkaido', 'tohoku', 'kanto-koshinetsu', 'tokai-hokuriku', 'kinki',
                        'chugoku' , 'sikoku', 'kyushu-okinawa' ]
prefs_head_by_area  = [  0,  1,  7, 17, 24, 30, 35, 39 ]
prefs_count_by_area = [  1,  6, 10,  7,  6,  5,  4,  8 ]
prefectures         = [
                        'hokkaido' ,

                        'aomori'   , 'iwate'     , 'miyagi'    , 'akita'    , 'yamagata' ,
                        'fukushima',

                        'tokyo'    , 'kanagawa'  , 'saitama'   , 'chiba'    , 'ibaraki'  ,
                        'tochigi'  , 'gunma'     , 'yamanashi' , 'niigata'  , 'nagano'   ,

                        'toyama'   , 'ishikawa'  , 'fukui'     , 'gifu'     , 'shizuoka' ,
                        'aichi'    , 'mie'       ,

                        'shiga'    , 'kyoto'     , 'osaka'     , 'hyogo'    , 'nara'     ,
                        'wakatama' ,

                        'tottori'  , 'shimane'   , 'okayama'   , 'hitoshima', 'yamaguchi',

                        'tokushima', 'kagawa'    , 'ehime'     , 'kochi'    ,

                        'fukuoka'  , 'saga'      , 'nagasaki'  , 'kumamoto' , 'oita'     ,
                        'miyazaki' , 'kagoshima' , 'okinawa'
                      ]

image_sizes         = ['xl' , 'l' , 'm' , 's']
max_pages           = 21
waiting_seconds     = 6

# make output folder

command = ('mkdir', '-p', './txt')
res     = subprocess.call(command)

# functions

def generate_target_urls():
    for  i in range(0,len(prefs_head_by_area)):

        for j in range(prefs_head_by_area[i], \
                       prefs_head_by_area[i] + prefs_count_by_area[i]):

            for k in range(1, max_pages):
                target_url = base_url \
                 + 'area=' + areas[i] \
                 + '&prefectures='\
                 + prefectures[j] \
                 + '&page=' \
                 + str(k)

                time.sleep(waiting_seconds)
                html          = requests.get(target_url)
                html.encoding = 'utf-8'
                soup          = BeautifulSoup(html.text, 'html.parser')
                atags         = soup.find_all('a')

                for l in atags:
                    m = l['href']
                    n = '^/ja/i/'
                    o = re.match( n, m )
                    if o:
                        target_urls.append([i, j, m, 'z'])
                    else:
                        None
    return

def update_details_in_target_urls():
    base_image_url = 'https://search.find47.jp/ja/images/'
    for i in target_urls:

        for j in image_sizes:
            time.sleep(waiting_seconds)
            image_url = base_image_url + str(i[2][-5:]) + '/download/' + j
            image_link = requests.get(image_url)

            if image_link.status_code == 200:
                target_urls[target_urls.index(i)][2] = str(i[2][-5:])
                target_urls[target_urls.index(i)][3] = j
                break
    return

def write_out_to_csv_file():
    with open('./txt/01.csv', mode = 'w', encoding = 'utf-8') as f:
        for i in target_urls:
            writer = csv.writer(f)
            writer.writerow(i)
    f.close()
    return

# main routine
## generate target urls list as a text file with info in a simple format.

### stage 01
print('stage 01/03 started.')
generate_target_urls()
print('stage 01 completed.')

### stage 02
print('stage 02/03 started.')
update_details_in_target_urls()
print('stage 02 completed.')

### stage 03
print('stage 03/03 started.')
write_out_to_csv_file()
print('stage 03/03 completed.')
print('All operations of 01_generate_urls.py completed.')

# end of this script

Code (2)

Save it with the file name 47_finder.sh directly under the created appropriate folder (eg / home / nekoneko / codes / python / find47). In addition, chmod + x.

#!/bin/bash
cd /home/nekoneko/codes/python/find47

python3 ./01_generate_urls.py > ./txt/01.log 2>&1
#python3 ./02_download_jpgs.py > ./txt/02.log 2>&1

Run

It is recommended to put it in cron. The log file is ./txt/01.log. You can see that there are 1080 images in 8 areas (from Hokkaido to Kyushu and Okinawa). image.png

Execution example

Although it is a little different, the file is created as ./txt/01.csv in the following format (the screen is under development). image.png

Estimated travel time

It takes about 10 hours to create this list. It will take about 10 hours for the next image acquisition as well.

This summary

This article introduces the procedure for acquiring beautiful landscape images of 47 prefectures nationwide using Python from the wonderful find / 47 site. Of these, this time I explained up to the point where the target URL is output to a text file with code. In the next article, I will get an image based on the list obtained this time.

Recommended Posts

Get images of great find / 47 sites using Python (1/2: until target list creation)
Get images of great find / 47 sites using Python (Part 2/2: I published the target list on github)
[python] Get a list of instance variables
[Python] Get a list of folders only
Get rid of DICOM images in Python
Anonymous upload of images using Imgur API (using Python)
Find the geometric mean of n! Using Python
List of sample program distribution sites for python books
[python] Get the list of classes defined in the module
Try projective transformation of images using OpenCV with Python
[Python] Using Line API [1st Creation of Beauty Bot]
[Python] Get the list of ExifTags names of Pillow library
[Python] Summary of table creation method using DataFrame (pandas)
Python: Get a list of methods for an object
List of python modules
List find in Python
Get the number of specific elements in a python list
Get a list of purchased DMM eBooks with Python + Selenium
Since Python 1.5 of Discord, I can't get a list of members
Get index of nth largest / smallest value in list in Python
How to get a list of built-in exceptions in python
List of libraries to install when installing Python using Pyenv
Get index of nth largest / smallest value in list in Python
[Python3] List of sites that I referred to when I started Python
Save images using python3 requests
Summary of Python3 list operations
List of self-made Docker images
python: Basics of using scikit-learn ①
[Python] Copy of multidimensional list
Faster loading of Python images
I tried to get the index of the list using the enumerate function
Try to get a list of breaking news threads in Python.
[python] Get the rank of the values in List in ascending / descending order
Aligning scanned images of animated video paper using OpenCV and Python
Get and set the value of the dropdown menu using Python and Selenium
Get a list of files in a folder with python without a path