[PYTHON] I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"

Introduction

When I made Update notification app to become a novelist using API, the update information of the work from My Page I thought that I would be more addicted to what I wanted to do, so I made it with the novel posting site version called Hameln.

What do you do?

It is an application that notifies LINE Notify of Hamelin update information using BeautifulSoup4 and IFTTT.

environment

Python 3.7.4
BeautifuSoup4 4.9.1

Before preparing ... Check and note about the difference between API and scraping

This time we will use scraping. It's a technology that is regulated by law, so let's check the story of the law. The first thing to keep in mind is not to overload the site's servers. This time, as a countermeasure, after getting or posting, time.sleep (1) is used to create a waiting time.

Difference between scraping and API
List of precautions for web scraping
Let's talk about the law of web scraping!

Preparation (Create Applet with IFTTT )

It is a service that links services different from IFTTT. This time, connect Webhooks and LINE Notify and have them send notifications to your LINE. Procedure 

Register with IFTTT

Click Create at the top right of the screen. The screen will change to the screen that says "if + This Then That". See the figure below.

Click + This. Type Webhooks into the search bar to select it.

Click on the column that says "Receive a web request". When you come to the screen below, enter your favorite name in "Event Name". I will use it later.

Click + That. Select LINE and click the "Send message" field.

Log in to LINE, set the content of the message to "Value1: Value1 \
" (it doesn't matter if you don't) and click "Create Action".

Check the contents and click Finish.

Click Explore, type Webhooks in the search window and select the Services tab. Click Webhooks. Maybe you can go from this link ...?

Click Documentation and you should see "Your key is: ~~~~~", so make a note of it. I will use it later. This is the end of IFTTT.

Source code description

Briefly explain the source code. import

import requests from bs4 import BeautifulSoup from urllib.parse import urljoin import csv import time

post_ifttt() It is a function to send a notification from IFTTT to LINE Notify. Use the Applet name and Webhooks key here. I also used it in the update notification to become a novelist.

def post_ifttt(json): # json: {value1: " content "} url = ( "https://maker.ifttt.com/trigger/" + # Applet Name + "/with/key/" + # Webhooks Key ) requests.post(url, json)

extract() This is the underlying function of this code. It will be used in the part described later. Extract one of ["Title"], ["Number of stories"], and ["URL"] from HTML according to the condition and store it in the list. It may be a little difficult to see the branch. It might have been better to write the if statements of condition in parallel. The "<" and "" parts are if statements that remove HTML tags and extract only the desired attributes.

def extract(info, condition, li): for item in info: if condition in str(item): a = "" is_a = 0 if condition!="href": for s in str(item): if s=="<" and is_a==1: is_a = 0 li.append(a) break if is_a==1: if condition=="latest": if "0" <= s and s <= "9": a+=s else: a += s if s==">" and is_a==0: is_a = 1 else: if "mode=user" in str(item): continue for s in str(item): if s=="\"" and is_a==1: is_a = 0 li.append(a) break if is_a==1: a += s if s=="\"" and is_a==0: is_a = 1

Login Since scraping is done from Hameln's My Page, you can POST the necessary information from the login screen and log in. The information required for the login process differs from site to site and can be confirmed from the developer tools for each site, but in Hamelin it is "id, pass, mode". The mode is "last_entry_end" for everyone. POST this information and log in. The detailed usage of Beautifu Soup is summarized in the article below, so please have a look.

[Python3] Scraping on a site with login function [requests] [Beautiful Soup]
Scraping sites that require login with Python
Login to the website in Python

############################################################## # Log in # ############################################################## # id, pass with open("input.txt") as f: """ input.txt: [ID PASS] """ s = f.read().split() ID = s[0] PASS = s[1] session = requests.session() url_login = "https://syosetu.org/?mode=login" response = session.get(url_login) time.sleep(1) login_info = { "id":ID, "pass":PASS, "mode":"login_entry_end" } res = session.post(url_login, data=login_info) res.raise_for_status() # for error time.sleep(1)

By the way, input.txt is an input file in which the ID and password are saved in this order with one half-width space. Example)

input.txt

ID_hoge passwd_hoge

User name output The user name part is extracted from the HTML of the user information page. Easy.

############################################################### # Print User Name # ############################################################### soup_myage = BeautifulSoup(res.text, "html.parser") account_href = soup_myage.select_one(".spotlight li a").attrs["href"] url_account = urljoin(url_login, account_href) res_account = session.get(url_account) res_account.raise_for_status() time.sleep(1) soup_account = BeautifulSoup(res_account.text, "html.parser") user_name = str((soup_account.select(".section3 h3"))[0])[4:-5].split("／")[0] print("Hello "+ user_name + "!")

Get information about your favorite novels from each favorite page There are multiple favorite pages. So, from each page, store ["Title"], ["Number of stories"], and ["URL"] in the list title, latest_no, and ncode, respectively. Check for updates later and save to a file.

############################################################### # Page Transition # ############################################################### a_list = soup_myage.select(".section.pickup a") favo_a = "" for _ in a_list: if("To favorite list" in _): favo_a = _ break url_favo = urljoin(url_login, favo_a.attrs["href"]) res_favo = session.get(url_favo) res_favo.raise_for_status() time.sleep(1) soup_favo = BeautifulSoup(res_favo.text, "html.parser") bookmark_titles = soup_favo.select(".section3 h3 a") bookmark_latest = soup_favo.select(".section3 p a") titles = [] latest_no = [] ncode = [] extract(bookmark_titles, "novel", titles) extract(bookmark_latest, "latest", latest_no) extract(bookmark_titles, "href", ncode) ############################################################### # Start Page Transition # ############################################################### number_of_bookmarks_h2 = soup_favo.select_one(".heading h2") number_of_bookmarks = "" for s in str(number_of_bookmarks_h2)[4:-5]: if s>="0" and s<='9': number_of_bookmarks += s number_of_bookmarks = int(number_of_bookmarks) number_of_favo_pages = number_of_bookmarks // 10 + 1 for i in range(2,number_of_favo_pages+1): url_favo = "https://syosetu.org/?mode=favo&word=&gensaku=&type=&page=" + str(i) res_favo = session.get(url_favo) res_favo.raise_for_status() soup_favo = BeautifulSoup(res_favo.text, "html.parser") bookmark_titles = soup_favo.select(".section3 h3 a") bookmark_latest = soup_favo.select(".section3 p a") extract(bookmark_titles, "novel", titles) extract(bookmark_latest, "latest", latest_no) extract(bookmark_titles, "href", ncode) time.sleep(1)

Data acquisition The newly acquired information is stored in bookmark_info, and the previously acquired information is stored in data. Then check if it has been updated.

############################################################### # Get Latest Data # ############################################################### bookmark_info = [] for i in range(len(titles)): bookmark_info.append([titles[i], latest_no[i], ncode[i]]) ############################################################### # Get Previous Data # ############################################################### read_file = "hameln.csv" with open(read_file, encoding="utf-8") as f: reader = csv.reader(f) data = [row for row in reader] ############################################################### # Check Whether Novels are Updated # ############################################################### """ previous data: data latest data: bookmark_info """ for prev in data: for latest in bookmark_info: if prev[0] == latest[0]: # check if prev[1] != latest[1]: print(str(latest[0]) + "Has been updated.\n" + latest[2]) json = {"value1" : str(latest[0]) +"Has been updated.\n" + latest[2]} post_ifttt(json)

Write update information to a file 

############################################################### # Write Latest Information # ############################################################### output = "hameln.csv" with open(output, mode='w', newline="", encoding="utf-8") as f: writer = csv.writer(f) for i in range(len(bookmark_info)): writer.writerow([bookmark_info[i][0], bookmark_info[i][1], bookmark_info[i][2]])

GitHub I uploaded it to GitHub ( here ). Please take a look if you like.

At the end

The login process was the most interesting part of the knowledge gained from this app. You're not just passing in your ID and password. Also, automation was done with the task scheduler. For more information on how to use Task Scheduler, please see the references section.

References

[Python3] Scraping on a site with login function [requests] [Beautiful Soup]
Scraping sites that require login with Python
Login to the website in Python
#Run Python with Windows Task Scheduler

Recommended Posts
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"

I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"

I tried to notify slack of Redmine update

I tried to extract and illustrate the stage of the story using COTOHA

I tried to automate the article update of Livedoor blog with Python and selenium.

Save the text of all Evernote notes to SQLite using Beautiful Soup and SQLAlchemy

I tried to get the index of the list using the enumerate function

I became horror when I tried to detect the features of anime faces using PCA and NMF.

I tried to predict the up and down of the closing price of Gurunavi's stock price using TensorFlow (progress)

I tried to transform the face image using sparse_image_warp of TensorFlow Addons

I tried to get the batting results of Hachinai using image processing

I tried to visualize the age group and rate distribution of Atcoder

I tried to estimate the similarity of the question intent using gensim's Doc2Vec

I tried to verify and analyze the acceleration of Python by Cython

I tried the common story of using Deep Learning to predict the Nikkei 225

Using COTOHA, I tried to follow the emotional course of Run, Melos!

I tried to touch the API of ebay

I tried to correct the keystone of the image

I tried using the image filter of OpenCV

I tried to predict the price of ETF

I tried to vectorize the lyrics of Hinatazaka46!

I tried to deliver mail from Node.js and Python using the mail delivery service (SendGrid) of IBM Cloud!

I tried to predict the deterioration of the lithium ion battery using the Qore SDK

[Python] I tried to judge the member image of the idol group using Keras

The first step to get rid of slow queries! I tried to notify Chatwork of slow queries for RDS for MySQL using Lambda and AWS CLI v2

I tried to summarize the basic form of GPLVM

I tried to approximate the sin function using chainer

I tried using the API of the salmon data project

I tried to visualize the spacha information of VTuber

I tried to erase the negative part of Meros

I tried to identify the language using CNN + Melspectogram

I tried to notify the honeypot report on LINE

I tried to complement the knowledge graph using OpenKE

I tried to classify the voices of voice actors

I tried to compress the image using machine learning

I tried to summarize the string operations of Python

[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.

I tried to predict the victory or defeat of the Premier League using the Qore SDK

Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz

I tried to score the syntax that was too humorous and humorous using the COTOHA API.

I tried to extract the text in the image file using Tesseract of the OCR engine

I tried to compare the processing speed with dplyr of R and pandas of Python

I tried to find the entropy of the image with python

[Horse Racing] I tried to quantify the strength of racehorses

I tried to get the location information of Odakyu Bus

I tried to find the average of the sequence with TensorFlow

I tried to notify the train delay information with LINE Notify

I tried refactoring the CNN model of TensorFlow using TF-Slim

I tried to simulate ad optimization using the bandit algorithm.

I tried to get Web information using "Requests" and "lxml"

I tried face recognition of the laughter problem using Keras.

I tried to illustrate the time and time in C language

I tried to display the time and today's weather w

[Python] I tried to visualize the follow relationship of Twitter

[TF] I tried to visualize the learning result using Tensorboard

[Machine learning] I tried to summarize the theory of Adaboost

I want to know the features of Python and pip

[Python] I tried collecting data using the API of wikipedia

I tried to enumerate the differences between java and python

I tried to fight the Local Minimum of Goldstein-Price Function

I displayed the chat of YouTube Live and tried playing