[PYTHON] Scraping the schedule of Hinatazaka46 and reflecting it in Google Calendar

Conclusion

If you are not interested in code and just want to add a calendar here. If you have a Google Account, you can add it immediately.

In this article, we aim to obtain information from HP and automatically generate the following calendar. スクリーンショット 2020-10-12 21.23.11.png

This will -** Turn on Google Calendar notifications and you'll never miss their activity ** -** Since you know the activity schedule in advance, you can reduce the risk of not being able to see other schedules ** There are merits such as.

background

Hinatazaka46 is one of the Sakamichi groups, and is a group whose motto is "Happy Aura". Of course, there are many people who are attracted to their "** visual ", " brightness " and " attitude to work hard on anything **", and I am one of them.

The most reliable way to follow their activities is to check the "Schedule" page on HP. I often see it myself.

However, -** Inefficient to see what activity is on which day ** (I can't tell at a glance because I have to scroll to the location of the day) -** Not necessarily described in chronological order ** (The schedule of "18: 00 ~" may be described in the next paragraph of the schedule of "22: 00 ~") I was personally dissatisfied with that.

Also, by introducing the calendar of this site, it is possible to cover major events, but detailed events (fixed). It seemed that it was not covered for (irregular activities that were not done).

So, in order to eliminate these dissatisfactions, I thought about realizing ** "Reflect their schedule in my Google Calendar" **.

Implementation

version

Preparation

You need to get the Google API. For the procedure, please refer to this article for easy understanding.

Also, if you want to perform regular execution, it is better to use cron or Heroku. I personally like Heroku, which doesn't need to run on my local pc, so I use it. Regarding Heroku, I explained how to use it in My hatena blog before, so please refer to that if you like.

procedure

  1. Scraping the necessary information from Schedule on HP
  2. Reflect information in Google Calendar

① Scraping necessary information from HP

Function to get event information

The information to be acquired is the following four.

--Category

Since there may be multiple appearance events on the same day,

  1. Get all the events for each date (search_event_each_date)
  2. Get the event for a specific day (search_event_info)
  3. Get detailed information about one event (search_detail_info)

Information is acquired in the flow.

def search_event_each_date(year, month):
    url = (
        f"https://www.hinatazaka46.com/s/official/media/list?ima=0000&dy={year}{month}"
    )
    result = requests.get(url)
    soup = BeautifulSoup(result.content, features="lxml")
    events_each_date = soup.find_all("div", {"class": "p-schedule__list-group"})

    time.sleep(3)  # NOTE:Eliminate the load on the server

    return events_each_date


def search_event_info(event_each_date):
    event_date_text = remove_blank(event_each_date.contents[1].text)[
        :-1
    ]  # NOTE:Get information other than the day of the week
    events_time = event_each_date.find_all("div", {"class": "c-schedule__time--list"})
    events_name = event_each_date.find_all("p", {"class": "c-schedule__text"})
    events_category = event_each_date.find_all("div", {"class": "p-schedule__head"},)
    events_link = event_each_date.find_all("li", {"class": "p-schedule__item"})

    return event_date_text, events_time, events_name, events_category, events_link


def search_detail_info(event_name, event_category, event_time, event_link):
    event_name_text = remove_blank(event_name.text)
    event_category_text = remove_blank(event_category.contents[1].text)
    event_time_text = remove_blank(event_time.text)
    event_link = event_link.find("a")["href"]
    active_members = search_active_member(event_link)

    return event_name_text, event_category_text, event_time_text, active_members


def search_active_member(link):
    try:
        url = f"https://www.hinatazaka46.com{link}"
        result = requests.get(url)
        soup = BeautifulSoup(result.content, features="lxml")
        active_members = soup.find("div", {"class": "c-article__tag"}).text
        time.sleep(3)  # NOTE:Eliminate server load
    except AttributeError:
        active_members = ""

    return active_members

def remove_blank(text):
    text = text.replace("\n", "")
    text = text.replace(" ", "")
    return text

** [Addition] ** In the version of 2020/10/14, it was not possible to correctly acquire events other than media-related events. Therefore, modify it as follows. (In the code above, it's already reflected.)

(Before correction)

events_category = event_each_date.find_all(
     "div", {"class": "c-schedule__category category_media"}
)

event_category_text = remove_blank(event_category.text)

(Revised)

events_category = event_each_date.find_all("div", {"class": "p-schedule__head"},)

event_category_text = remove_blank(event_category.contents[1].text)

Now events like "Birthday" and "LIVE" can be correctly reflected in the calendar.

Functions related to time

Especially regarding time, depending on the notation ――It's the next day, like "24: 20 ~ 25: 00" --In the first place, there is only date information Since there are cases such as, prepare a function corresponding to them.

def over24Hdatetime(year, month, day, times):
    """
Convert time over 24H to datetime
    """
    hour, minute = times.split(":")[:-1]

    # to minute
    minutes = int(hour) * 60 + int(minute)

    dt = datetime.datetime(year=int(year), month=int(month), day=int(day))
    dt += datetime.timedelta(minutes=minutes)

    return dt.strftime("%Y-%m-%dT%H:%M:%S")


def prepare_info_for_calendar(
    event_name_text, event_category_text, event_time_text, active_members
):
    event_title = f"({event_category_text}){event_name_text}"
    if event_time_text == "":
        event_start = f"{year}-{month}-{event_date_text}"
        event_end = f"{year}-{month}-{event_date_text}"
        is_date = True
    else:
        start, end = search_start_and_end_time(event_time_text)
        event_start = over24Hdatetime(year, month, event_date_text, start)
        event_end = over24Hdatetime(year, month, event_date_text, end)
        is_date = False
    return event_title, event_start, event_end, is_date

② Reflect information in Google Calendar

The general procedure is as follows.

  1. Create an instance based on the API
  2. Determine if the event was previously added
  3. Add event

API settings

from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

def build_calendar_api():
    SCOPES = ["https://www.googleapis.com/auth/calendar"]
    creds = None
    if os.path.exists("token.pickle"):
        with open("token.pickle", "rb") as token:
            creds = pickle.load(token)
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file("credentials.json", SCOPES)
            creds = flow.run_local_server(port=0)
        with open("token.pickle", "wb") as token:
            pickle.dump(creds, token)

    service = build("calendar", "v3", credentials=creds)

    return service

Determining if it is a previously added event

Before adding, check based on "event name-time" to determine "whether it is a previously added event". Get the list for that with the search_events function.

def search_events(service, calendar_id, start):

    end_datetime = datetime.datetime.strptime(start, "%Y-%m-%d") + relativedelta(
        months=1
    )
    end = end_datetime.strftime("%Y-%m-%d")

    events_result = (
        service.events()
        .list(
            calendarId=calendar_id,
            timeMin=start + "T00:00:00+09:00",  # NOTE:+09:It is important to set it to 00. (Convert UTC to JST)
            timeMax=end + "T23:59:00+09:00",  # NOTE;Search period until next month.
        )
        .execute()
    )
    events = events_result.get("items", [])

    if not events:
        return []
    else:
        events_starttime = change_event_starttime_to_jst(events)
        return [
            event["summary"] + "-" + event_starttime
            for event, event_starttime in zip(events, events_starttime)
        ]

def change_event_starttime_to_jst(events):
    events_starttime = []
    for event in events:
        if "date" in event["start"].keys():
            events_starttime.append(event["start"]["date"])
        else:
            str_event_uct_time = event["start"]["dateTime"]
            event_jst_time = datetime.datetime.strptime(
                str_event_uct_time, "%Y-%m-%dT%H:%M:%S+09:00"
            )
            str_event_jst_time = event_jst_time.strftime("%Y-%m-%dT%H:%M:%S")
            events_starttime.append(str_event_jst_time)
    return events_starttime

Add event

def add_date_schedule(
    event_name, event_category, event_time, event_link, previous_add_event_lists
):
    (
        event_name_text,
        event_category_text,
        event_time_text,
        active_members,
    ) = search_detail_info(event_name, event_category, event_time, event_link)

    #Preparation of information to be reflected in the calendar
    (event_title, event_start, event_end, is_date,) = prepare_info_for_calendar(
        event_name_text, event_category_text, event_time_text, active_members,
    )

    if (
        f"{event_title}-{event_start}" in previous_add_event_lists
    ):  # NOTE:Pass if the same appointment already exists
        pass
    else:
        add_info_to_calendar(
            calendarId, event_title, event_start, event_end, active_members, is_date,
        )


def add_info_to_calendar(calendarId, summary, start, end, active_members, is_date):

    if is_date:
        event = {
            "summary": summary,
            "description": active_members,
            "start": {"date": start, "timeZone": "Japan",},
            "end": {"date": end, "timeZone": "Japan",},
        }
    else:
        event = {
            "summary": summary,
            "description": active_members,
            "start": {"dateTime": start, "timeZone": "Japan",},
            "end": {"dateTime": end, "timeZone": "Japan",},
        }

    event = service.events().insert(calendarId=calendarId, body=event,).execute()

Full text

This time, I am trying to reflect the schedule from this month to 3 months ahead in Google Calendar. Only calendarId needs to set the id of my calendar.


import time
import pickle
import os.path

import requests
from bs4 import BeautifulSoup

import datetime
from dateutil.relativedelta import relativedelta

from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request


def build_calendar_api():
    SCOPES = ["https://www.googleapis.com/auth/calendar"]
    creds = None
    if os.path.exists("token.pickle"):
        with open("token.pickle", "rb") as token:
            creds = pickle.load(token)
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file("credentials.json", SCOPES)
            creds = flow.run_local_server(port=0)
        with open("token.pickle", "wb") as token:
            pickle.dump(creds, token)

    service = build("calendar", "v3", credentials=creds)

    return service


def remove_blank(text):
    text = text.replace("\n", "")
    text = text.replace(" ", "")
    return text


def search_event_each_date(year, month):
    url = (
        f"https://www.hinatazaka46.com/s/official/media/list?ima=0000&dy={year}{month}"
    )
    result = requests.get(url)
    soup = BeautifulSoup(result.content, features="lxml")
    events_each_date = soup.find_all("div", {"class": "p-schedule__list-group"})

    time.sleep(3)  # NOTE:Eliminate the load on the server

    return events_each_date


def search_start_and_end_time(event_time_text):
    has_end = event_time_text[-1] != "~"
    if has_end:
        start, end = event_time_text.split("~")
    else:
        start = event_time_text.split("~")[0]
        end = start
    start += ":00"
    end += ":00"
    return start, end


def search_event_info(event_each_date):
    event_date_text = remove_blank(event_each_date.contents[1].text)[
        :-1
    ]  # NOTE:Get information other than the day of the week
    events_time = event_each_date.find_all("div", {"class": "c-schedule__time--list"})
    events_name = event_each_date.find_all("p", {"class": "c-schedule__text"})
    events_category = event_each_date.find_all("div", {"class": "p-schedule__head"},)
    events_link = event_each_date.find_all("li", {"class": "p-schedule__item"})

    return event_date_text, events_time, events_name, events_category, events_link


def search_detail_info(event_name, event_category, event_time, event_link):
    event_name_text = remove_blank(event_name.text)
    event_category_text = remove_blank(event_category.contents[1].text)
    event_time_text = remove_blank(event_time.text)
    event_link = event_link.find("a")["href"]
    active_members = search_active_member(event_link)

    return event_name_text, event_category_text, event_time_text, active_members

def search_active_member(link):
    try:
        url = f"https://www.hinatazaka46.com{link}"
        result = requests.get(url)
        soup = BeautifulSoup(result.content, features="lxml")
        active_members = soup.find("div", {"class": "c-article__tag"}).text
        time.sleep(3)  # NOTE:Eliminate server load
    except AttributeError:
        active_members = ""

    return active_members


def over24Hdatetime(year, month, day, times):
    """
Convert time over 24H to datetime
    """
    hour, minute = times.split(":")[:-1]

    # to minute
    minutes = int(hour) * 60 + int(minute)

    dt = datetime.datetime(year=int(year), month=int(month), day=int(day))
    dt += datetime.timedelta(minutes=minutes)

    return dt.strftime("%Y-%m-%dT%H:%M:%S")


def prepare_info_for_calendar(
    event_name_text, event_category_text, event_time_text, active_members
):
    event_title = f"({event_category_text}){event_name_text}"
    if event_time_text == "":
        event_start = f"{year}-{month}-{event_date_text}"
        event_end = f"{year}-{month}-{event_date_text}"
        is_date = True
    else:
        start, end = search_start_and_end_time(event_time_text)
        event_start = over24Hdatetime(year, month, event_date_text, start)
        event_end = over24Hdatetime(year, month, event_date_text, end)
        is_date = False
    return event_title, event_start, event_end, is_date


def change_event_starttime_to_jst(events):
    events_starttime = []
    for event in events:
        if "date" in event["start"].keys():
            events_starttime.append(event["start"]["date"])
        else:
            str_event_uct_time = event["start"]["dateTime"]
            event_jst_time = datetime.datetime.strptime(
                str_event_uct_time, "%Y-%m-%dT%H:%M:%S+09:00"
            )
            str_event_jst_time = event_jst_time.strftime("%Y-%m-%dT%H:%M:%S")
            events_starttime.append(str_event_jst_time)
    return events_starttime


def search_events(service, calendar_id, start):

    end_datetime = datetime.datetime.strptime(start, "%Y-%m-%d") + relativedelta(
        months=1
    )
    end = end_datetime.strftime("%Y-%m-%d")

    events_result = (
        service.events()
        .list(
            calendarId=calendar_id,
            timeMin=start + "T00:00:00+09:00",  # NOTE:+09:It is important to set it to 00. (Convert UTC to JST)
            timeMax=end + "T23:59:00+09:00",  # NOTE;Search period until next month.
        )
        .execute()
    )
    events = events_result.get("items", [])

    if not events:
        return []
    else:
        events_starttime = change_event_starttime_to_jst(events)
        return [
            event["summary"] + "-" + event_starttime
            for event, event_starttime in zip(events, events_starttime)
        ]


def add_date_schedule(
    event_name, event_category, event_time, event_link, previous_add_event_lists
):
    (
        event_name_text,
        event_category_text,
        event_time_text,
        active_members,
    ) = search_detail_info(event_name, event_category, event_time, event_link)

    #Preparation of information to be reflected in the calendar
    (event_title, event_start, event_end, is_date,) = prepare_info_for_calendar(
        event_name_text, event_category_text, event_time_text, active_members,
    )

    if (
        f"{event_title}-{event_start}" in previous_add_event_lists
    ):  # NOTE:Pass if the same appointment already exists
        pass
    else:
        add_info_to_calendar(
            calendarId, event_title, event_start, event_end, active_members, is_date,
        )


def add_info_to_calendar(calendarId, summary, start, end, active_members, is_date):

    if is_date:
        event = {
            "summary": summary,
            "description": active_members,
            "start": {"date": start, "timeZone": "Japan",},
            "end": {"date": end, "timeZone": "Japan",},
        }
    else:
        event = {
            "summary": summary,
            "description": active_members,
            "start": {"dateTime": start, "timeZone": "Japan",},
            "end": {"dateTime": end, "timeZone": "Japan",},
        }

    event = service.events().insert(calendarId=calendarId, body=event,).execute()


if __name__ == "__main__":

    # -------------------------step1:various settings-------------------------
    #API system
    calendarId = (
        "〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜"  # NOTE:My calendar ID
    )
    service = build_calendar_api()

    #Search range
    num_search_month = 3  # NOTE;Reflected in the calendar up to the schedule 3 months ahead
    current_search_date = datetime.datetime.now()
    year = current_search_date.year
    month = current_search_date.month

    # -------------------------step2.Get information for each date-------------------------
    for _ in range(num_search_month):
        events_each_date = search_event_each_date(year, month)
        for event_each_date in events_each_date:

            # step3:Get schedules for a specific day at once
            (
                event_date_text,
                events_time,
                events_name,
                events_category,
                events_link,
            ) = search_event_info(event_each_date)

            event_date_text = "{:0=2}".format(
                int(event_date_text)
            )  # NOTE;Filled with 0s to 2 digits (ex.0-> 01)
            start = f"{year}-{month}-{event_date_text}"
            previous_add_event_lists = search_events(service, calendarId, start)

            # step4:Add information to the calendar
            for event_name, event_category, event_time, event_link in zip(
                events_name, events_category, events_time, events_link
            ):
                add_date_schedule(
                    event_name,
                    event_category,
                    event_time,
                    event_link,
                    previous_add_event_lists,
                )

        # step5:To the next month
        current_search_date = current_search_date + relativedelta(months=1)
        year = current_search_date.year
        month = current_search_date.month



Finally

In this article, I introduced how to reflect the schedule of Hinatazaka46 in Google Calendar. This will -** Turn on Google Calendar notifications and you'll never miss their activity ** -** Since you know the activity schedule in advance, you can reduce the risk of not being able to see other schedules ** There are merits such as.

This time, we focused on Hinatazaka46, but if you change "(1) Scraping necessary information from HP", you can reuse (2) and reflect the schedule of any person in Google Calendar.

━━━━━━━━━━

If you don't know Hinatazaka46, why don't you take an interest in this? Personally, ** "Let's meet at Hinatazaka" broadcast on TV TOKYO from 25:05 every Sunday. ** is recommended. You will be amazed and attracted to the high variety ability that you can't think of as an idol. In addition, I think it is good to know from the song at Hinatazaka46 OFFICIAL YouTube CHANNEL.

Also, as a complete digression, my recent recommendation is Konoka Matsuda, who has a very nice smile. What's good?

matsudakonoka.png Image posting blog

Reference site

How to extract arbitrary events in Google Calendar with Python

Adding an event to Google Calendar in Python

[Python] Get / add Google Calendar appointments using Google Calendar API

About python datetime

━━━━━━━━━━ Hinatazaka46 Home Page

Let's meet at Hinatazaka

Hinatazaka46 OFFICIAL YouTube CHANNEL

Konoka Matsuda's blog

Recommended Posts

Scraping the schedule of Hinatazaka46 and reflecting it in Google Calendar
Get the latest schedule from Google Calendar and notify it on LINE every morning
Scraping the list of Go To EAT member stores in Fukuoka prefecture and converting it to CSV
Find it in the procession and edit it
Scraping the list of Go To EAT member stores in Niigata prefecture and converting it to CSV
Predict the amount of electricity used in 2 days and publish it in CSV
Scraping the holojour and displaying it with CLI
[Python] Precautions when retrieving data by scraping and putting it in the list
Scraping the rainfall data of the Japan Meteorological Agency and displaying it on M5Stack
If you define a method in a Ruby class and define a method in it, it becomes a method of the original class.
Import the schedule obtained from "Schedule-kun" into Google Calendar
The result of making a map album of Italy honeymoon in Python and sharing it
Read the csv file and display it in the browser
I want a Slack bot that calculates and tells me the salary of a part-time job from the schedule of Google Calendar!
Investigation of the relationship between speech preprocessing and transcription accuracy in the Google Cloud Speech API
I made a calendar that automatically updates the distribution schedule of Vtuber (Google Calendar edition)
Scraping PDF of the status of test positives in each prefecture of the Ministry of Health, Labor and Welfare
Scraping the member stores of Go To EAT in Osaka Prefecture and converting them to CSV
Implement the mathematical model "SIR model" of infectious diseases in OpenModelica (reflecting mortality rate and reinfection rate)
Try scraping the data of COVID-19 in Tokyo with Python
The process of making Python code object-oriented and improving it
[Tips] Problems and solutions in the development of python + kivy
Google search for the last line of the file in Python
Scraping the result of "Schedule-kun"
[Python] The role of the asterisk in front of the variable. Divide the input value and assign it to a variable
I tried scraping the ranking of Qiita Advent Calendar with Python
Count the number of Thai and Arabic characters well in Python
Probability of getting the highest and lowest turnip prices in Atsumori
Notify the contents of the task before and after executing the task in Fabric
Convert the result of python optparse to dict and utilize it
Verify the compression rate and time of PIXZ used in practice
Get the title and delivery date of Yahoo! News in Python
Note that I understand the algorithm of the machine learning naive Bayes classifier. And I wrote it in Python.
[Rails 6] Embed Google Map in the app and add a marker to the entered address. [Confirmation of details]
[Python / Jupyter] Translate the comment of the program copied to the clipboard and insert it in a new cell
[Cliff in 2025] The Ministry of Economy, Trade and Industry's "DX Report 2" was published, so I read it.
Use Cloud Dataflow to dynamically change the destination according to the value of the data and save it in GCS
How to copy and paste the contents of a sheet in Google Spreadsheet in JSON format (using Google Colab)
The story of Python and the story of NaN
The story of participating in AtCoder
The story of the "hole" in the file
Snippets (scraping) registered in Google Colaboratory
The meaning of ".object" in Django
Explanation and implementation of the XMPP protocol used in Slack, HipChat, and IRC
[Python] Explore the characteristics of the titles of the top sites in Google search results
I made a calendar that automatically updates the distribution schedule of Vtuber
Graph of the history of the number of layers of deep learning and the change in accuracy
[Python] Sweet Is it sweet? About suites and expressions in the official documentation
Comparing the basic grammar of Python and Go in an easy-to-understand manner
Change the saturation and brightness of color specifications like # ff000 in python 2.5
Enclose the cat result in double quotes and put it in a variable
The one that divides the csv file, reads it, and processes it in parallel
I set the environment variable with Docker and displayed it in Python
I vectorized the chord of the song with word2vec and visualized it with t-SNE
Look up the names and data of free variables in function objects
[Android] Display images on the web in the info Window of Google Map
The google search console sitemap api client is in webmasters instead of search console
Open an Excel file in Python and color the map of Japan
What I investigated in the process of expressing (schematicizing) containers in a nested frame with Jupyter and making it
A simple mock server that simply embeds the HTTP request header in the body of the response and returns it.
Setting to make the scale and label of the figure easy to see even in the dark theme with google Colaboratory