First of all, get the deliverables here 2nd article, and then the following story is OK.
Do you know the wonderful site find / 47? Beautiful landscape images of 47 prefectures nationwide are provided in high quality. According to our survey, as of the end of October 2020, there are a total of 1080 types nationwide in Japan, and it is possible to download these by resolution of S, M, L, XL (depending on the image). However, if you take a quick look at the number of public accesses, it seems that it is not fully recognized and utilized, and it is regrettable. So in this article (2 times +), I'll use Python to (1) create a list of images, (2) download it, and (3) use Ubuntu / LXDE for [variety](https://peterlevi. I will introduce how to realize an auto-change environment for wallpapers using com / variety /). At a level, it's an introductory "something you want to achieve with scraping, implement it in Python, and check it." I myself haven't done much technique or optimization. Please note. Ultimately, I'm happy to realize a desktop like the one in the image below (as a personal hobby). That's all.
This first article will not affect the execution, but you will need about 10GB of free space to download the images. In the second or third article, I will touch on how to reduce the resolution and reduce the image download capacity.
Please do it appropriately.
Ubuntu
$ cat /etc/issue
Ubuntu 20.04.1 LTS \n \l
Python
$ python3 --version
Python 3.8.5
installed by pip3 There may be others. Look at the beginning of the code and take action.
$ pip3 list
beautifulsoup4 4.8.2
requests 2.22.0
tqdm 4.50.2
There are 6 steps to save the image. In this 1/2 article, we will do 1-3 steps. Specifically, it outputs a list of images to be downloaded to a text file in a somewhat readable form. First, while page-turning, make it in a comprehensively unreadable form (stage 01, in-memory), add size information to each line of the list (stage 02, in-memory), and output it in CSV format (stage 03, in-memory). File output). So far, we have obtained what size (xl, x, m, s) of what file name (Note: no extension) in what region (0-7), what prefecture (0-46), and in total. Determine how many images you can get.
Create an appropriate folder (eg /home/nekoneko/codes/python/find47) and save it with the file name 01_generate_urls.py directly under it.
#!/usr/bin/env python3
# coding+ utf-8
import csv
import re
import requests
import subprocess
import time
from bs4 import BeautifulSoup
# e.g. https://search.find47.jp/ja/images?area=kinki&prefectures=kyoto&page=3
# declare variables
base_url = 'https://search.find47.jp/ja/images?'
valid_urls = []
target_urls = []
areas = [ 'hokkaido', 'tohoku', 'kanto-koshinetsu', 'tokai-hokuriku', 'kinki',
'chugoku' , 'sikoku', 'kyushu-okinawa' ]
prefs_head_by_area = [ 0, 1, 7, 17, 24, 30, 35, 39 ]
prefs_count_by_area = [ 1, 6, 10, 7, 6, 5, 4, 8 ]
prefectures = [
'hokkaido' ,
'aomori' , 'iwate' , 'miyagi' , 'akita' , 'yamagata' ,
'fukushima',
'tokyo' , 'kanagawa' , 'saitama' , 'chiba' , 'ibaraki' ,
'tochigi' , 'gunma' , 'yamanashi' , 'niigata' , 'nagano' ,
'toyama' , 'ishikawa' , 'fukui' , 'gifu' , 'shizuoka' ,
'aichi' , 'mie' ,
'shiga' , 'kyoto' , 'osaka' , 'hyogo' , 'nara' ,
'wakatama' ,
'tottori' , 'shimane' , 'okayama' , 'hitoshima', 'yamaguchi',
'tokushima', 'kagawa' , 'ehime' , 'kochi' ,
'fukuoka' , 'saga' , 'nagasaki' , 'kumamoto' , 'oita' ,
'miyazaki' , 'kagoshima' , 'okinawa'
]
image_sizes = ['xl' , 'l' , 'm' , 's']
max_pages = 21
waiting_seconds = 6
# make output folder
command = ('mkdir', '-p', './txt')
res = subprocess.call(command)
# functions
def generate_target_urls():
for i in range(0,len(prefs_head_by_area)):
for j in range(prefs_head_by_area[i], \
prefs_head_by_area[i] + prefs_count_by_area[i]):
for k in range(1, max_pages):
target_url = base_url \
+ 'area=' + areas[i] \
+ '&prefectures='\
+ prefectures[j] \
+ '&page=' \
+ str(k)
time.sleep(waiting_seconds)
html = requests.get(target_url)
html.encoding = 'utf-8'
soup = BeautifulSoup(html.text, 'html.parser')
atags = soup.find_all('a')
for l in atags:
m = l['href']
n = '^/ja/i/'
o = re.match( n, m )
if o:
target_urls.append([i, j, m, 'z'])
else:
None
return
def update_details_in_target_urls():
base_image_url = 'https://search.find47.jp/ja/images/'
for i in target_urls:
for j in image_sizes:
time.sleep(waiting_seconds)
image_url = base_image_url + str(i[2][-5:]) + '/download/' + j
image_link = requests.get(image_url)
if image_link.status_code == 200:
target_urls[target_urls.index(i)][2] = str(i[2][-5:])
target_urls[target_urls.index(i)][3] = j
break
return
def write_out_to_csv_file():
with open('./txt/01.csv', mode = 'w', encoding = 'utf-8') as f:
for i in target_urls:
writer = csv.writer(f)
writer.writerow(i)
f.close()
return
# main routine
## generate target urls list as a text file with info in a simple format.
### stage 01
print('stage 01/03 started.')
generate_target_urls()
print('stage 01 completed.')
### stage 02
print('stage 02/03 started.')
update_details_in_target_urls()
print('stage 02 completed.')
### stage 03
print('stage 03/03 started.')
write_out_to_csv_file()
print('stage 03/03 completed.')
print('All operations of 01_generate_urls.py completed.')
# end of this script
Save it with the file name 47_finder.sh directly under the created appropriate folder (eg / home / nekoneko / codes / python / find47). In addition, chmod + x.
#!/bin/bash
cd /home/nekoneko/codes/python/find47
python3 ./01_generate_urls.py > ./txt/01.log 2>&1
#python3 ./02_download_jpgs.py > ./txt/02.log 2>&1
It is recommended to put it in cron. The log file is ./txt/01.log. You can see that there are 1080 images in 8 areas (from Hokkaido to Kyushu and Okinawa).
Although it is a little different, the file is created as ./txt/01.csv in the following format (the screen is under development).
It takes about 10 hours to create this list. It will take about 10 hours for the next image acquisition as well.
This article introduces the procedure for acquiring beautiful landscape images of 47 prefectures nationwide using Python from the wonderful find / 47 site. Of these, this time I explained up to the point where the target URL is output to a text file with code. In the next article, I will get an image based on the list obtained this time.
Recommended Posts