[PYTHON] Nogizaka46 A program that automatically saves blog images

Introduction

Hello. This is @tfujitani, a nerd on a slope. This time, I created Python that can save the blog image of the specified member of Nogizaka46 fully automatically, so I will publish it. By the way, the reason why I decided to make this program “trigger” is because Sayuri Inoue, who was a good candidate, graduated. (It's a good song, isn't it?) The program created for that purpose will be released now.

What I did this time

What was used

・ Python ・ Beautiful Soup

Installation of scraping environment

pip install requests
pip install beautifulsoup4

Python code

I'm scraping using Beautiful Soup and Python3. This time, specify the blog URL of the member you want to specify. For Manatsu Akimoto (http://blog.nogizaka46.com/manatsu.akimoto/), "manatsu.akimoto" Riria Ito (http://blog.nogizaka46.com/riria.itou/) is like "riria.itou". You can also specify the start and end points of the period you want to save.

nogiblog.py


# coding:utf-8
from time import sleep
import time
from bs4 import BeautifulSoup
import sys
import requests, urllib.request, os
from selenium.common.exceptions import TimeoutException

domain="http://blog.nogizaka46.com/"
member="manatsu.akimoto" #Member designation
url=domain+member+"/"

def getImages(soup,cnt,mouthtrue):
    member_path="./"+member
    #Function to save image
    for entry in soup.find_all("div", class_="entrybody"):#Get all entry bodies
        for img in entry.find_all("img"):#Get all img
            cnt +=1
            imgurl=img.attrs["src"]
            imgurlnon=imgurl.replace('https','http')
            if mouthtrue:
                try:
                    urllib.request.urlretrieve(imgurlnon, member_path+ str(year)+'0'+str(mouth) + "-" + str(cnt) + ".jpeg ")
                except:
                    print("error",imgurlnon)
            else:
                try:
                    urllib.request.urlretrieve(imgurlnon, member_path + str(year)+str(mouth) + "-" + str(cnt) + ".jpeg ")
                except:
                    print("error",imgurlnon)


if(__name__ == "__main__"):
    #The beginning of the blog to save
    year=2012
    mouth=12
    #End of blog to save
    endyear=2020
    endmouth=6

    while(True):
        mouthtrue=True
        if mouth<10:
            BlogPageURL=url+"?d="+str(year)+"0"+str(mouth)
        else:
            BlogPageURL=url+"?d="+str(year)+str(mouth)
            mouthtrue=False
        headers = {"User-Agent": "Mozilla/5.0"}
        soup = BeautifulSoup(requests.get(BlogPageURL, headers=headers).content, 'html.parser')#Get html
        print(year,mouth)
        sleep(3)
        cnt = 0
        ht=soup.find_all("div", class_="paginate")
        print("ht",ht)
        getImages(soup,cnt,mouthtrue)#Calling the image storage function
        if len(ht)>0:#If there are multiple pages in the same month, save only that page
            ht_url=ht[0]
            print(ht_url)
            url_all=ht_url.find_all("a")
            for i,hturl in enumerate(url_all):
                if (i+1)==len(url_all):
                    break
                link = hturl.get("href")
                print("url",url+link)
                soup = BeautifulSoup(requests.get(url+link, headers=headers).content, 'html.parser')
                sleep(3)
                getImages(soup,cnt,mouthtrue)#Calling the image storage function
        if year==endyear and mouth==endmouth:
            print("Finish")
            sys.exit()#The end of the program
        if mouth==12:
            mouth=1
            year=year+1
            print("update",year,mouth)
        else:
            mouth=mouth+1
            print("update",year,mouth)

By the way, "#If there are multiple pages in the same month, save only those pages" is an image like this. スクリーンショット 2020-06-26 15.21.20.png In the example of this image, it is Manatsu Akimoto's blog in January 2013, but after saving the image on the first page, get 2, 3 and 4 links and display the image on each page. The content is to save.

Execution result

When I tried it on Manatsu Akimoto's blog, I was able to confirm that the image was saved in the following form.

By the way, I thought that ht was difficult to understand in the previous program, so I will display the execution result of that part. It's a little confusing, but like this, each monthly page is scraped.

ht 
[<div class="paginate"> 1  | <a href="?p=2&amp;d=201301"> 2 </a> | <a href="?p=3&amp;d=201301"> 3 </a> | <a href="?p=4&amp;d=201301"> 4 </a> | <a href="?p=2&amp;d=201301">></a></div>, <div class="paginate"> 1  | <a href="?p=2&amp;d=201301"> 2 </a> | <a href="?p=3&amp;d=201301"> 3 </a> | <a href="?p=4&amp;d=201301"> 4 </a> | <a href="?p=2&amp;d=201301">></a></div>]

After that, you can see that after scraping the first page as shown below, scraping the fourth page.

url http://blog.nogizaka46.com/manatsu.akimoto/?p=2&d=201301
url http://blog.nogizaka46.com/manatsu.akimoto/?p=3&d=201301
url http://blog.nogizaka46.com/manatsu.akimoto/?p=4&d=201301

in conclusion

As an aside, when I saved Sayuri Inoue's blog, I had a margin of over several thousand (2385 by erasing the unnecessary ones). スクリーンショット 2020-06-26 15.40.16.png You can see the part of Sayu's hard worker.

References

The article at https://qiita.com/xxPowderxx/items/e9726b8b8a114655d796 was insanely helpful.

Recommended Posts

Nogizaka46 A program that automatically saves blog images
Publishing and using a program that automatically collects facial images of specified people
Nogizaka46 Get blog images by scraping
Create a program that can generate your favorite images with Selenium
I made a program that automatically calculates the zodiac with tkinter
[Python] A program that creates stairs with #
A program that automatically corrects "Takenoko no Sato" to "Kinoko no Yama" "correctly"
A story about writing a program that automatically summarizes your own asset transitions
A program that plays rock-paper-scissors using Python
PGM that automatically creates a walking route
[Python] A program that rounds the score
Create a program that automatically inputs and sends body temperature every morning [Note]
I want to exe and distribute a program that resizes images Python3 + pyinstaller
A system that automatically attends university zoom classes
A program that removes duplicate statements in Python
A program that searches for the same image
A Vim plugin that automatically formats Python styles
A shell program that displays the Fibonacci sequence
[Python] A program that counts the number of valleys
A shell program that becomes aho in multiples of 3
[Python] A program that compares the positions of kangaroos.
A Python program that converts ical data into text
A program that automatically resizes the iOS app icon to the required image size in Python
I made a program to collect images in tweets that I liked on twitter with Python
Let's create a program that automatically registers ID/PW from CSV to Bitwarden with Python + Selenium
A Python program that collects tweets containing specific keywords daily and saves them in csv