Get images of great find / 47 sites using Python (1/2: until target list creation)

Postscript after publication of the article

First of all, get the deliverables here 2nd article, and then the following story is OK.

at first

Do you know the wonderful site find / 47? Beautiful landscape images of 47 prefectures nationwide are provided in high quality. According to our survey, as of the end of October 2020, there are a total of 1080 types nationwide in Japan, and it is possible to download these by resolution of S, M, L, XL (depending on the image). However, if you take a quick look at the number of public accesses, it seems that it is not fully recognized and utilized, and it is regrettable. So in this article (2 times +), I'll use Python to (1) create a list of images, (2) download it, and (3) use Ubuntu / LXDE for [variety](https://peterlevi. I will introduce how to realize an auto-change environment for wallpapers using com / variety /). At a level, it's an introductory "something you want to achieve with scraping, implement it in Python, and check it." I myself haven't done much technique or optimization. Please note. Ultimately, I'm happy to realize a desktop like the one in the image below (as a personal hobby). That's all.

Points to note

This first article will not affect the execution, but you will need about 10GB of free space to download the images. In the second or third article, I will touch on how to reduce the resolution and reduce the image download capacity.

Operating environment

Please do it appropriately.

Ubuntu

$ cat /etc/issue
Ubuntu 20.04.1 LTS \n \l

Python

$ python3 --version
Python 3.8.5

installed by pip3 There may be others. Look at the beginning of the code and take action.

$ pip3 list
beautifulsoup4               4.8.2
requests                     2.22.0
tqdm                         4.50.2

Operation outline and code

Operation overview

There are 6 steps to save the image. In this 1/2 article, we will do 1-3 steps. Specifically, it outputs a list of images to be downloaded to a text file in a somewhat readable form. First, while page-turning, make it in a comprehensively unreadable form (stage 01, in-memory), add size information to each line of the list (stage 02, in-memory), and output it in CSV format (stage 03, in-memory). File output). So far, we have obtained what size (xl, x, m, s) of what file name (Note: no extension) in what region (0-7), what prefecture (0-46), and in total. Determine how many images you can get.

コード(1)01_generate_urls.py

Create an appropriate folder (eg /home/nekoneko/codes/python/find47) and save it with the file name 01_generate_urls.py directly under it.

#!/usr/bin/env python3
# coding+ utf-8

import csv
import re
import requests
import subprocess
import time
from bs4 import BeautifulSoup

# e.g. https://search.find47.jp/ja/images?area=kinki&prefectures=kyoto&page=3

# declare variables

base_url            =   'https://search.find47.jp/ja/images?'

valid_urls          = []
target_urls         = []

areas               = [ 'hokkaido', 'tohoku', 'kanto-koshinetsu', 'tokai-hokuriku', 'kinki',
                        'chugoku' , 'sikoku', 'kyushu-okinawa' ]
prefs_head_by_area  = [  0,  1,  7, 17, 24, 30, 35, 39 ]
prefs_count_by_area = [  1,  6, 10,  7,  6,  5,  4,  8 ]
prefectures         = [
                        'hokkaido' ,

                        'aomori'   , 'iwate'     , 'miyagi'    , 'akita'    , 'yamagata' ,
                        'fukushima',

                        'tokyo'    , 'kanagawa'  , 'saitama'   , 'chiba'    , 'ibaraki'  ,
                        'tochigi'  , 'gunma'     , 'yamanashi' , 'niigata'  , 'nagano'   ,

                        'toyama'   , 'ishikawa'  , 'fukui'     , 'gifu'     , 'shizuoka' ,
                        'aichi'    , 'mie'       ,

                        'shiga'    , 'kyoto'     , 'osaka'     , 'hyogo'    , 'nara'     ,
                        'wakatama' ,

                        'tottori'  , 'shimane'   , 'okayama'   , 'hitoshima', 'yamaguchi',

                        'tokushima', 'kagawa'    , 'ehime'     , 'kochi'    ,

                        'fukuoka'  , 'saga'      , 'nagasaki'  , 'kumamoto' , 'oita'     ,
                        'miyazaki' , 'kagoshima' , 'okinawa'
                      ]

image_sizes         = ['xl' , 'l' , 'm' , 's']
max_pages           = 21
waiting_seconds     = 6

# make output folder

command = ('mkdir', '-p', './txt')
res     = subprocess.call(command)

# functions

def generate_target_urls():
    for  i in range(0,len(prefs_head_by_area)):

        for j in range(prefs_head_by_area[i], \
                       prefs_head_by_area[i] + prefs_count_by_area[i]):

            for k in range(1, max_pages):
                target_url = base_url \
                 + 'area=' + areas[i] \
                 + '&prefectures='\
                 + prefectures[j] \
                 + '&page=' \
                 + str(k)

                time.sleep(waiting_seconds)
                html          = requests.get(target_url)
                html.encoding = 'utf-8'
                soup          = BeautifulSoup(html.text, 'html.parser')
                atags         = soup.find_all('a')

                for l in atags:
                    m = l['href']
                    n = '^/ja/i/'
                    o = re.match( n, m )
                    if o:
                        target_urls.append([i, j, m, 'z'])
                    else:
                        None
    return

def update_details_in_target_urls():
    base_image_url = 'https://search.find47.jp/ja/images/'
    for i in target_urls:

        for j in image_sizes:
            time.sleep(waiting_seconds)
            image_url = base_image_url + str(i[2][-5:]) + '/download/' + j
            image_link = requests.get(image_url)

            if image_link.status_code == 200:
                target_urls[target_urls.index(i)][2] = str(i[2][-5:])
                target_urls[target_urls.index(i)][3] = j
                break
    return

def write_out_to_csv_file():
    with open('./txt/01.csv', mode = 'w', encoding = 'utf-8') as f:
        for i in target_urls:
            writer = csv.writer(f)
            writer.writerow(i)
    f.close()
    return

# main routine
## generate target urls list as a text file with info in a simple format.

### stage 01
print('stage 01/03 started.')
generate_target_urls()
print('stage 01 completed.')

### stage 02
print('stage 02/03 started.')
update_details_in_target_urls()
print('stage 02 completed.')

### stage 03
print('stage 03/03 started.')
write_out_to_csv_file()
print('stage 03/03 completed.')
print('All operations of 01_generate_urls.py completed.')

# end of this script

Code (2)

Save it with the file name 47_finder.sh directly under the created appropriate folder (eg / home / nekoneko / codes / python / find47). In addition, chmod + x.

#!/bin/bash
cd /home/nekoneko/codes/python/find47

python3 ./01_generate_urls.py > ./txt/01.log 2>&1
#python3 ./02_download_jpgs.py > ./txt/02.log 2>&1

Run

It is recommended to put it in cron. The log file is ./txt/01.log. You can see that there are 1080 images in 8 areas (from Hokkaido to Kyushu and Okinawa).

Execution example

Although it is a little different, the file is created as ./txt/01.csv in the following format (the screen is under development).

Estimated travel time

It takes about 10 hours to create this list. It will take about 10 hours for the next image acquisition as well.

This summary

This article introduces the procedure for acquiring beautiful landscape images of 47 prefectures nationwide using Python from the wonderful find / 47 site. Of these, this time I explained up to the point where the target URL is output to a text file with code. In the next article, I will get an image based on the list obtained this time.