Scraping TV program information from Yahoo! TV.G Guide I wrote the code to do. There are two points that I devised.
requests
.
So I solved this problem by using selenium
.For the time being, I tried to output the start time, channel, program title as csv as follows
Start time,Channel,Program title
21:54,5,Hodo Station
22:30,1,Historical Anecdote Historia "Hirosaki Castle 400 Years of the Northern Castle"
22:50,2,Cat commentary, cats and rice scoops. "Shuichi Yoshida and Kin-chan Gin-chan"
23:00,4,news zero New arrival "mild" man waiting at home(52)Death ... waiting for a bed
23:00,6,NEWS23 Ayaka Ogawa ▽ Mone Kamishiraishi Covers the famous song in telework
23:00,7,WBS ▽ Supermarket is very crowded ... 3 How to prevent it?What is shopping agency? ▽ Iris decides to increase mask production
23:00,8,TOKIO Kakeru [Yosuke Eguchi&Kenichi Takito confesses unexpected couple life!Masterpiece drama also released]
...
Scraping may be considered illegal depending on how it is used. Please also refer to Terms of Service and robots.txt each time. Please use the code included in this article at your own discretion and at your own risk.
macOS Catalina version 10.15.4 and python3.8 Or Raspberry Pi 3 Model B + and Rasbian Stretch and python3.5
Install beautifulsoup, selenium with pip
pip install beautifulsoup4
pip install selenium
If the Chrome browser is not installed on your PC, install it. https://www.google.com/intl/ja_jp/chrome/
Install chromedriver with pip. Please install the one that matches the version of Chrome browser.
pip install chromedriver-binary==<Chrome version number>
Reference: [For selenium] How to install Chrome Driver with pip (no need to pass through, version can be specified) https://qiita.com/hanzawak/items/2ab4d2a333d6be6ac760
Install the chromium driver. (This time, it happened to work below, but honestly I do not understand so well. It is assumed that it may not work because the browser and version are different, but what should I do in such a case? Is unconfirmed)
sudo apt-get install chromium-chromedriver
Reference: Browser operation on Raspberry Pi with Selenium and chrome driver https://www.miki-ie.com/raspberry-pi/raspberry-pi%E3%81%ABselenium%E3%81%A8chromedriver%E3%81%A7%E3%83%96%E3%83%A9%E3%82%A6%E3%82%B6%E6%93%8D%E4%BD%9C/
We have confirmed that it works as of April 21, 2020, but please note that it may not work if the structure of the page changes in the future. Below, the program title, start time, and channel information are acquired. Specifically, rewrite the part that acquires information from html as appropriate according to the purpose.
from bs4 import BeautifulSoup
from selenium import webdriver
import platform
#Branch the process depending on the OS so that it works with the same code on both Mac and Raspberry Pi
OS = platform.system()
if OS == 'Darwin': #For Mac
import chromedriver_binary
elif OS == 'Linux': #For Raspberry Pi
pass
output_file_path = 'program.csv'
area = '23' #Which prefecture's program guide to display. 23 is Tokyo.
date = '20200421' #What year, month, and day timetable should be displayed.
starttime = '20' #What time will the program guide be displayed?
duration_hour = '6' #How many hours of program listings should be displayed.
url = 'https://tv.yahoo.co.jp/listings/?'
#Added regional information. Can be omitted (commented out). The default value is 23(Tokyo)
url += ('a='+area+'&')
#Added date information. Can be omitted (commented out). The default value if omitted is the current date
url += ('d='+date+'&')
#Added time information. Can be omitted (commented out). The default value if omitted is the present tense
url += ('st='+starttime+'&')
#Added display target time information. Can be omitted (commented out). Default value is 6 (unit is time)
url += ('va='+duration_hour+'&')
#Get webdriver. If you get an error around here, chromedriver-Doubt the version difference of binary
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)
#Load the web page, get the html and parse it with beautifulSoup
driver.get(url)
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html,'html.parser')
#Acquisition of program guide information
#Get the channel list written at the top of the program guide
station_elems = soup.find_all('td', class_='station')
stations = [elem.text.split('ch')[0] for elem in station_elems]
#Acquisition of elements including program title
title_elems = soup.find_all('a', class_='title')
table = [['Start time','Channel','Program title']]
for elem in title_elems:
#Get title
title = elem.text
#Get start time
starttime = elem.parent.find('span',class_='time').text
#Which column in the program guide is the information
col = int(elem.get('data-ylk').split('pos:')[1])
#Get the channel number from the column number
channel = stations[col-1]
#Add program title, channel, start time to elements
table.append([starttime,channel,title])
#Save in csv format
with open(output_file_path,'w') as f:
f.write('\n'.join([','.join(v) for v in table]))
How to spend the terminal-Scraping TV listings http://moxtsuan.hatenablog.com/entry/scrape-tvprogram
Recommended Posts