Overview

Scraping TV program information from Yahoo! TV.G Guide I wrote the code to do. There are two points that I devised.

Since the program information is dynamically read after the page is displayed, it cannot be obtained well with Python's requests. So I solved this problem by using selenium.
Since the development was done on Mac and Raspberry Pi, some processing is branched according to the OS so that the code can be shared between both environments.

For the time being, I tried to output the start time, channel, program title as csv as follows

Start time,Channel,Program title
21:54,5,Hodo Station
22:30,1,Historical Anecdote Historia "Hirosaki Castle 400 Years of the Northern Castle"
22:50,2,Cat commentary, cats and rice scoops. "Shuichi Yoshida and Kin-chan Gin-chan"
23:00,4,news zero New arrival "mild" man waiting at home(52)Death ... waiting for a bed
23:00,6,NEWS23 Ayaka Ogawa ▽ Mone Kamishiraishi Covers the famous song in telework
23:00,7,WBS ▽ Supermarket is very crowded ... 3 How to prevent it?What is shopping agency? ▽ Iris decides to increase mask production
23:00,8,TOKIO Kakeru [Yosuke Eguchi&Kenichi Takito confesses unexpected couple life!Masterpiece drama also released]
...

Caution

Scraping may be considered illegal depending on how it is used. Please also refer to Terms of Service and robots.txt each time. Please use the code included in this article at your own discretion and at your own risk.

environment

macOS Catalina version 10.15.4 and python3.8 Or Raspberry Pi 3 Model B + and Rasbian Stretch and python3.5

Preparation

Common to Mac and Raspberry Pi

Install beautifulsoup, selenium with pip

pip install beautifulsoup4
pip install selenium

Mac only

If the Chrome browser is not installed on your PC, install it. https://www.google.com/intl/ja_jp/chrome/

Install chromedriver with pip. Please install the one that matches the version of Chrome browser.

pip install chromedriver-binary==<Chrome version number>

Reference: [For selenium] How to install Chrome Driver with pip (no need to pass through, version can be specified) https://qiita.com/hanzawak/items/2ab4d2a333d6be6ac760

Raspberry Pi only

Install the chromium driver. (This time, it happened to work below, but honestly I do not understand so well. It is assumed that it may not work because the browser and version are different, but what should I do in such a case? Is unconfirmed)

sudo apt-get install chromium-chromedriver

Reference: Browser operation on Raspberry Pi with Selenium and chrome driver https://www.miki-ie.com/raspberry-pi/raspberry-pi%E3%81%ABselenium%E3%81%A8chromedriver%E3%81%A7%E3%83%96%E3%83%A9%E3%82%A6%E3%82%B6%E6%93%8D%E4%BD%9C/

code

We have confirmed that it works as of April 21, 2020, but please note that it may not work if the structure of the page changes in the future. Below, the program title, start time, and channel information are acquired. Specifically, rewrite the part that acquires information from html as appropriate according to the purpose.

from bs4 import BeautifulSoup
from selenium import webdriver   
import platform
#Branch the process depending on the OS so that it works with the same code on both Mac and Raspberry Pi
OS = platform.system()
if OS == 'Darwin': #For Mac
    import chromedriver_binary
elif OS == 'Linux': #For Raspberry Pi
    pass

output_file_path = 'program.csv'

area = '23' #Which prefecture's program guide to display. 23 is Tokyo.
date = '20200421' #What year, month, and day timetable should be displayed.
starttime = '20' #What time will the program guide be displayed?
duration_hour = '6' #How many hours of program listings should be displayed.

url = 'https://tv.yahoo.co.jp/listings/?'
#Added regional information. Can be omitted (commented out). The default value is 23(Tokyo)
url += ('a='+area+'&')
#Added date information. Can be omitted (commented out). The default value if omitted is the current date
url += ('d='+date+'&')
#Added time information. Can be omitted (commented out). The default value if omitted is the present tense
url += ('st='+starttime+'&')
#Added display target time information. Can be omitted (commented out). Default value is 6 (unit is time)
url += ('va='+duration_hour+'&')

#Get webdriver. If you get an error around here, chromedriver-Doubt the version difference of binary
options = webdriver.ChromeOptions() 
options.add_argument('--headless') 
driver = webdriver.Chrome(options=options)

#Load the web page, get the html and parse it with beautifulSoup
driver.get(url)
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html,'html.parser')

#Acquisition of program guide information

#Get the channel list written at the top of the program guide
station_elems = soup.find_all('td', class_='station')
stations = [elem.text.split('ch')[0] for elem in station_elems]
#Acquisition of elements including program title
title_elems = soup.find_all('a', class_='title')

table = [['Start time','Channel','Program title']]
for elem in title_elems:
    #Get title
    title = elem.text
    #Get start time
    starttime = elem.parent.find('span',class_='time').text
    #Which column in the program guide is the information
    col = int(elem.get('data-ylk').split('pos:')[1])
    #Get the channel number from the column number
    channel = stations[col-1]
    #Add program title, channel, start time to elements
    table.append([starttime,channel,title])
    
#Save in csv format
with open(output_file_path,'w') as f:
    f.write('\n'.join([','.join(v) for v in table]))

reference

How to spend the terminal-Scraping TV listings http://moxtsuan.hatenablog.com/entry/scrape-tvprogram

Scraping dynamically loaded TV program listings [Python] [Selenium]