[PYTHON] Scraping Keikyu Line timetable

Introduction

When I left the office, I was able to automatically tweet from my smartphone, so Next, I would like to display the departure time and delay information of the station closest to my workplace. That way, you can take a detour because you have time, or you can just walk because you don't have time.

There is "Ekispert Web Service" as an API to get the timetable, but there is a charge.

"Ekispert Web Service" https://docs.ekispert.com/v1/

Then I thought that I should make it myself, and when I researched various things, an interesting article came out.

Blog "Omitting scissors": Thoughts on timetable scraping tools https://nkth.info/blog/dia_scraping/

It seems that there is no problem for personal use by extracting timetable information without burdening the server. Therefore, we will get the timetable and delay information by scraping. I commute on the Keikyu line, so this time I will focus on the Keikyu line.

In this article, we will get the departure time and delay information of Keikyu Line by scraping.

2.png What this article aims for ↑

Assumption and execution environment

The assumptions this time are as follows. Railway line used: Keikyu Main Line Nearest station to work: Koganechou Station (It has nothing to do with my address) Train used: Regular train down

Execution environment: Ubuntu18.04LTS (assuming Android or AWS in the future) Python 3.6.9

Environment

A well-known library for scraping is beautifulsoup.

Code Zine: Try parsing HTML with Python and collecting data? "Python second grade" that understands scraping from the beginning https://codezine.jp/article/detail/12230

However, in the case of Keikyu Line timetable, beautifulsoup cannot get it well. The data acquired by beautiful soup for trial is as follows.

Acquired data


<html>
<head>
<title>Keikyu Main Line: Keikyu Taura Station Timetable</title>
<meta content="max-age=1" http-equiv="Cache-Control"/>
</head>
<body>
★ Search results<br/>
Keikyu Main Line<br/>
Keikyu Taura Station Timetable<br/>

For Uraga<br/>
Weekday diamond<br/>
23:04 Uraga<font color="#000000">usually</font><br/>
23:17 Uraga<font color="#000000">usually</font><br/>
23:27 Uraga<font color="#000000">usually</font><br/>
23:40 Uraga<font color="#000000">usually</font><br/>
23:52 Uraga<font color="#000000">usually</font><br/>
 0:04 Uraga<font color="#000000">usually</font><br/>
 0:23 Uraga<font color="#000000">usually</font><br/>
2020/7/20 as of now<br/>
<hr/>
<a href="T5?uid=27329&amp;dir=38&amp;path=2020102623438741&amp;slCode=250-40&amp;time=2125&amp;d=2&amp;dw=0&amp;date=&amp;pFlg=2&amp;reFlg=0&amp;USR=IM">↑ Previous time</a><br/>
<a accesskey="1" href="T3?uid=27329&amp;dir=38&amp;path=2020102623438741&amp;sf=%8B%9E%8B%7D%93%63%89%59&amp;sfCode=2053&amp;slCode=250-40&amp;d=2&amp;time=2300&amp;dw=0&amp;USR=IM">1.Direction/To time selection</a><br/>
<hr width="80%"/>
The timetables of stations nationwide<a href="http://1069.jp/">Station search ★ Timetable</a>What<hr width="80%"/>
<a accesskey="8" href="/transit/norikae/T1?USR=IM&amp;sf=%8B%9E%8B%7D%93%63%89%59">8.Station timetable top</a><br/>
<a accesskey="9" href="http://www.keikyu.co.jp/m/index.html">9.To the top</a><br/>
<center>(C)KEIKYU</center>
</body>
</html>

Only part of the diamond is displayed. This is because Keikyu's web server only sends information for the browser to draw. Information visible in the browser ≠ Data returned from the server That's right.

GAMMASOFT: How to scrape web pages that cannot be retrieved with requests https://gammasoft.jp/blog/how-to-download-web-page-created-javascript/

So, as in the article above, use "requests-html" to get the page information after it has been processed by the browser. Enter the following command to install "requests-html".

requests-html installation


$ pip install requests #requests-html dependency library
$ pip install requests-html

Modify the sample file, specify the URL of Koganecho Station and normal descent in the Keikyu Line timetable, process it with a browser, and then get the source.

req.py


from requests_html import HTMLSession

url = "https://norikae.keikyu.co.jp/transit/norikae/T5?uid=34683&dir=7&path=202010272317801&USR=PC&dw=0&slCode=250-28&d=2&rsf=%89%A9%8B%E0%92%AC"
#Keikyu Line Koganecho Station Downhill Normal URL

#Session start
session = HTMLSession()
r = session.get(url)

#Generate HTML in the browser engine
r.html.render()#Generate HTML in your browser
print(r.text)#Output HTML generated by browser

Only at the first startup, "chromium" will be installed without permission.

State at the time of the first execution


$ python req.py 
[W:pyppeteer.chromium_downloader] start chromium download.
Download may take a few minutes.
100%|████████████████████████████████████████████████████████| 108773488/108773488 [00:09<00:00, 11116758.16it/s]
[W:pyppeteer.chromium_downloader] 
chromium download done.

Confirm that the page information is output safely. (It is too long to display all, so omit it.)

By the way, you can tell if it is operating normally with just the following command.

$ python req.py |grep operation information-A 1 #Outputs the line containing the search string "operation information" and the next line
<div style="font-size: larger;">[Operation information]<a href="https://unkou.keikyu.co.jp/?from=top" target="_blank">
Operates as usual

Creating a station table

We will make a correspondence table of the station name, day of the week (weekdays, Saturdays, holidays), direction (Sengakuji, Uraga) and URL.

Let's take a look at the URL of the Keikyu Line timetable page.

Station name: Koganecho, Day of the week: Weekdays, Direction: Uraga URL


https://norikae.keikyu.co.jp/transit/norikae/T5?uid=6403&dir=18&path=20201029203818692&USR=PC&dw=0&slCode=250-28&d=2&rsf=%89%A9%8B%E0%92%AC

There are various URL parameters, but the essential one is ・ Dw (Day of the week: Abbreviation for Day of the week? 0 is normal, 1 is Saturday, 2 is a holiday) ・ SlCode (Unique code of the station: from 250-00 (Shinagawa)) ・ D (Direction: Abbreviation for Direction? 1 goes up and 2 goes down) It was just. Therefore, the pages that can be accessed with the above URL are

Station name: Koganecho, Day of the week: Weekdays, Direction: Uraga URL (required parameters only)


https://norikae.keikyu.co.jp/transit/norikae/T5?dw=0&slCode=250-28&d=2

Is the same as. I feel refreshed. It seems that programming is easy if it is about this. By the way, slCode (253-7) at Haneda Airport was a missing number.

Implementation

Now that we have the page information, we will scrape it. Create the following program. See comments for details. This program (keikyu_futsu.py) and station table file (station_table.csv) are also posted on git-hub. https://github.com/zakuzakuzaki/zaki-aws/blob/main/station/koganecho.py

keikyu_futsu.py


from requests_html import HTMLSession #Scraping
import datetime #Get the current time
import csv #Read csv file
import sys #Command line arguments, program termination

def get_info(st, r):

    #Split in line breaks
    #Reference URL: https://karupoimou.hatenablog.com/entry/2019/07/08/112734
    page = r.text.split("\n")
    for i in range(len(page)):
        p = page[i]
        if "Operation information" in p:
            unko = page[i+1]#The content is written on the next line of the "operation information" character string.
    return st +"The station is"+ unko

def get_timetable(r):

    #Get timetable (time)
    hour = r.html.find(".side01")#Weekdays
    if len(hour)==0:
        hour = r.html.find(".side02")#Other than weekdays
    hour_list = []
    for h in hour:
        hour_list.append(int(h.text))
    train = hour_list[0]#Get the first train time

    #Get timetable (minutes)
    minute = r.html.find(".min1001")#Regular train class
    minute_list = []
    for m in minute:
        minute_list.append(int(m.text))
    del minute_list[0]#Delete the first and last elements as they are not the time
    del minute_list[-1]#Same as above

    #Initialization of a two-dimensional array to store a timetable
    num = len(minute_list)
    dep = [[0 for i in range(2)] for j in range(num)]

    #Timetable construction work
    for i in range(num):
        if  i>0 and minute_list[i-1] > minute_list[i]:
            train+=1
        dep[i] = (train, minute_list[i])
    return dep

def echo_dep(dep, time):

    #Array definition to store the three departure times ahead of the specified time
    dep_time = []
    #Get the nearest departure time from the current time
    next=0
    now_i=0
    num = len(dep)
    for i in range(num):#Search one by one from the first train.
        if dep[i][0]==time.hour:#Compare with current time (time)
            now_i = i
            if dep[i][1]>time.minute:#Compare with current time (minutes)
                next = i
                break
    if next==0:#Processing when the minute does not hit (advance the time)
        next=now_i+1

    #Creating a list for display
    for i in range(3):
        if next+i>=num:
            next = -1
            dep_time.append("~last train~")
        dep_time.append(str(dep[next+i][0]).zfill(2)+":"+str(dep[next+i][1]).zfill(2))

    return dep_time

def get_url(st , dir, dw):

    #Creating the parameter slCode
    #Reference: https://note.nkmk.me/python-csv-reader-writer/
    with open('station_table.csv') as f:
        reader = csv.reader(f)
        l = [row for row in reader]
    num = len(l)
    for i in range(num):
        if l[i][0]==st:#Search by station name
            slCode = l[i][1]
    if i==l:
        print("Error: Unknown station name.")
        sys.exit(1)
    #Parameter d,dw remains as it is
    return "https://norikae.keikyu.co.jp/transit/norikae/T5?dw="+dw+"&slCode="+slCode+"&d="+dir

if __name__ == '__main__':

    if len(sys.argv) != 4:
        print("Give the following command line arguments.\n1->Station name (Japanese), 2->Uphill:1, down:2,3->Weekdays:0, Saturday:1, holiday:2")
        sys.exit(1)

    url = get_url(sys.argv[1], sys.argv[2], sys.argv[3])
    #① Station name (Japanese), ② Uphill:1, down:2, ③ weekdays:0, Saturday:1, holiday:2

    #Session start
    session = HTMLSession()
    r = session.get(url)

    #Generate HTML in the browser engine
    r.html.render()

    #Get operation information
    info = get_info(sys.argv[1],r)

    #Get timetable
    dep = get_timetable(r)

    #Get the current time
    #Reference URL: https://note.nkmk.me/python-datetime-now-today/
    time = datetime.datetime.now()

    #Get the next departure time
    dep_time = echo_dep(dep, time)

    #Result output
    print("Thank you for your support today.")
    print(info)
    print("The next departure is,")
    [print(i) for i in dep_time]

    sys.exit(0)

The above program was referred to by referring to the following article. ・ Character string extraction Naro analysis record: [Python sample code] Explains a simple method when you want to extract only "lines containing a specific character string" by scraping. https://karupoimou.hatenablog.com/entry/2019/07/08/112734

note.nkmk.me: Extract character strings with Python (position / number of characters, regular expression) https://note.nkmk.me/python-str-extract/

・ Get the current time note.nkmk.me: Get the current time, date, date and time with Python https://note.nkmk.me/python-datetime-now-today/

・ Read csv file note.nkmk.me: Read / write (input / output) CSV file with Python https://note.nkmk.me/python-csv-reader-writer/

test

The execution result is as follows.

$ python keikyu.py Koganecho 2 1#Designate "Koganecho / Downhill / Weekdays"
Thank you for your support today.
Koganechou Station operates as usual
The next departure is,
22:12
22:22
22:30

I was able to display the operation status and the departure time of the next train safely. We have confirmed that it works normally even if the station name is changed.

in conclusion

I was able to display the departure time and delay information of the station closest to my workplace, assuming that I would leave the office. I scraped it for the first time, but it's quite fun. However, the python code is dirty, so I hope I can make it a little cleaner. By the way, this time it was only a regular train, but if you change the class to be extracted, other trains can also be supported. It seemed that it would take time to specify multiple trains, so I passed this time.

Correspondence between vehicle type and class (from Yokohama Station timetable)


<span class="syasyu1004"><span class="min1004">10</span>Limited Express</span>&nbsp;&nbsp;
<span class="syasyu1003"><span class="min1003">10</span>Limited express</span>&nbsp;&nbsp;
<span class="syasyu1010"><span class="min1010">10</span>Airport express</span>&nbsp;&nbsp;
<span class="syasyu1001"><span class="min1001">10</span>usually</span>

URL of the above page (Yokohama Station Uraga direction timetable) https://norikae.keikyu.co.jp/transit/norikae/T5?uid=22537&dir=34&path=2020102723533416&USR=PC&dw=0&slCode=250-25&d=2&rsf=%89%A1%95%6C

Recommended Posts

Scraping Keikyu Line timetable
Scraping 1