When I made Update notification app to become a novelist using API, the update information of the work from My Page I thought that I would be more addicted to what I wanted to do, so I made it with the novel posting site version called Hameln.
It is an application that notifies LINE Notify of Hamelin update information using BeautifulSoup4 and IFTTT.
This time we will use scraping. </ b> It's a technology that is regulated by law, so let's check the story of the law. The first thing to keep in mind is not to overload the site's servers. This time, as a countermeasure, after getting or posting, time.sleep (1) is used to create a waiting time.
It is a service that links services different from IFTTT. This time, connect Webhooks and LINE Notify and have them send notifications to your LINE. Procedure </ b>
Briefly explain the source code. import
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import csv
import time
post_ifttt() It is a function to send a notification from IFTTT to LINE Notify. Use the Applet name and Webhooks key here. I also used it in the update notification to become a novelist.
def post_ifttt(json):
# json: {value1: " content "}
url = (
"https://maker.ifttt.com/trigger/"
+ # Applet Name
+ "/with/key/"
+ # Webhooks Key
)
requests.post(url, json)
extract() This is the underlying function of this code. It will be used in the part described later. Extract one of ["Title"], ["Number of stories"], and ["URL"] from HTML according to the condition and store it in the list. It may be a little difficult to see the branch. It might have been better to write the if statements of condition in parallel. The "<" and "" parts are if statements that remove HTML tags and extract only the desired attributes.
def extract(info, condition, li):
for item in info:
if condition in str(item):
a = ""
is_a = 0
if condition!="href":
for s in str(item):
if s=="<" and is_a==1:
is_a = 0
li.append(a)
break
if is_a==1:
if condition=="latest":
if "0" <= s and s <= "9":
a+=s
else:
a += s
if s==">" and is_a==0:
is_a = 1
else:
if "mode=user" in str(item):
continue
for s in str(item):
if s=="\"" and is_a==1:
is_a = 0
li.append(a)
break
if is_a==1:
a += s
if s=="\"" and is_a==0:
is_a = 1
Login </ b> Since scraping is done from Hameln's My Page, you can POST the necessary information from the login screen and log in. The information required for the login process differs from site to site and can be confirmed from the developer tools for each site, but in Hamelin it is "id, pass, mode". The mode is "last_entry_end" for everyone. POST this information and log in. The detailed usage of Beautifu Soup is summarized in the article below, so please have a look.
##############################################################
# Log in #
##############################################################
# id, pass
with open("input.txt") as f:
"""
input.txt: [ID PASS]
"""
s = f.read().split()
ID = s[0]
PASS = s[1]
session = requests.session()
url_login = "https://syosetu.org/?mode=login"
response = session.get(url_login)
time.sleep(1)
login_info = {
"id":ID,
"pass":PASS,
"mode":"login_entry_end"
}
res = session.post(url_login, data=login_info)
res.raise_for_status() # for error
time.sleep(1)
By the way, input.txt is an input file in which the ID and password are saved in this order with one half-width space. Example)
input.txt
ID_hoge passwd_hoge
User name output </ b> The user name part is extracted from the HTML of the user information page. Easy.
###############################################################
# Print User Name #
###############################################################
soup_myage = BeautifulSoup(res.text, "html.parser")
account_href = soup_myage.select_one(".spotlight li a").attrs["href"]
url_account = urljoin(url_login, account_href)
res_account = session.get(url_account)
res_account.raise_for_status()
time.sleep(1)
soup_account = BeautifulSoup(res_account.text, "html.parser")
user_name = str((soup_account.select(".section3 h3"))[0])[4:-5].split("/")[0]
print("Hello "+ user_name + "!")
Get information about your favorite novels from each favorite page </ b> There are multiple favorite pages. So, from each page, store ["Title"], ["Number of stories"], and ["URL"] in the list title, latest_no, and ncode, respectively. Check for updates later and save to a file.
###############################################################
# Page Transition #
###############################################################
a_list = soup_myage.select(".section.pickup a")
favo_a = ""
for _ in a_list:
if("To favorite list" in _):
favo_a = _
break
url_favo = urljoin(url_login, favo_a.attrs["href"])
res_favo = session.get(url_favo)
res_favo.raise_for_status()
time.sleep(1)
soup_favo = BeautifulSoup(res_favo.text, "html.parser")
bookmark_titles = soup_favo.select(".section3 h3 a")
bookmark_latest = soup_favo.select(".section3 p a")
titles = []
latest_no = []
ncode = []
extract(bookmark_titles, "novel", titles)
extract(bookmark_latest, "latest", latest_no)
extract(bookmark_titles, "href", ncode)
###############################################################
# Start Page Transition #
###############################################################
number_of_bookmarks_h2 = soup_favo.select_one(".heading h2")
number_of_bookmarks = ""
for s in str(number_of_bookmarks_h2)[4:-5]:
if s>="0" and s<='9':
number_of_bookmarks += s
number_of_bookmarks = int(number_of_bookmarks)
number_of_favo_pages = number_of_bookmarks // 10 + 1
for i in range(2,number_of_favo_pages+1):
url_favo = "https://syosetu.org/?mode=favo&word=&gensaku=&type=&page=" + str(i)
res_favo = session.get(url_favo)
res_favo.raise_for_status()
soup_favo = BeautifulSoup(res_favo.text, "html.parser")
bookmark_titles = soup_favo.select(".section3 h3 a")
bookmark_latest = soup_favo.select(".section3 p a")
extract(bookmark_titles, "novel", titles)
extract(bookmark_latest, "latest", latest_no)
extract(bookmark_titles, "href", ncode)
time.sleep(1)
Data acquisition </ b> The newly acquired information is stored in bookmark_info, and the previously acquired information is stored in data. Then check if it has been updated.
###############################################################
# Get Latest Data #
###############################################################
bookmark_info = []
for i in range(len(titles)):
bookmark_info.append([titles[i], latest_no[i], ncode[i]])
###############################################################
# Get Previous Data #
###############################################################
read_file = "hameln.csv"
with open(read_file, encoding="utf-8") as f:
reader = csv.reader(f)
data = [row for row in reader]
###############################################################
# Check Whether Novels are Updated #
###############################################################
"""
previous data: data
latest data: bookmark_info
"""
for prev in data:
for latest in bookmark_info:
if prev[0] == latest[0]:
# check
if prev[1] != latest[1]:
print(str(latest[0]) + "Has been updated.\n" + latest[2])
json = {"value1" : str(latest[0]) +"Has been updated.\n" + latest[2]}
post_ifttt(json)
Write update information to a file </ b>
###############################################################
# Write Latest Information #
###############################################################
output = "hameln.csv"
with open(output, mode='w', newline="", encoding="utf-8") as f:
writer = csv.writer(f)
for i in range(len(bookmark_info)):
writer.writerow([bookmark_info[i][0], bookmark_info[i][1], bookmark_info[i][2]])
GitHub I uploaded it to GitHub ( here ). Please take a look if you like.
The login process was the most interesting part of the knowledge gained from this app. You're not just passing in your ID and password. Also, automation was done with the task scheduler. For more information on how to use Task Scheduler, please see the references section.
Recommended Posts