The Kusatsu City City Planning and Transportation Policy Division is converting the Kusatsu City community bus "Mame Bus" into open data. http://www.city.kusatsu.shiga.jp/kurashi/kotsudorokasen/mamebus/opendata.html
Here, I will handle the data of this "Mame Bus". The purpose is to automatically download all the data and properly build the DataBase to make the bus data available on the web. At this time, Excel data shall be analyzed not only on Windows but also on Linux.
This makes it possible to automatically use the latest data on many rental servers (within certain changes) without human intervention.
demo http://needtec.sakura.ne.jp/bus_data/kusatu.html
Github https://github.com/mima3/bus_data
Download the data with the following command and build the DB.
python import.py application.ini
If the data is updated, you can use this command to get the latest data. It's a good idea to run it regularly with cron.
Each line consists of the following three configurations.
name | format | Description |
---|---|---|
time table | Excel | It is the data that describes the arrival time of each bus station. Arrival times may vary depending on weekdays, Saturdays and days of the week |
Bus stop | csv | Bus stopの名前、読み、座標が格納されたCSVデータです。 There may be multiple data, clockwise and counterclockwise, on the same route. |
route map | shape | route mapの形状を表したshapeファイルです。この測地系は「平面直角2000(6 series)Please note that. There may be multiple data, clockwise and counterclockwise, on the same route. |
There are a few things to keep in mind when working with data.
There are multiple bus stops with the same name.
For example, take a look at M04_stops_ccw.csv. There are two lines at Nomura Sports Park Exit. The Nomura Sports Park Exit at 135.954709,35.023382 and the Nomura Sports Park Exit at 135.954445,35.023323.
Even if the bus stops are in the same position, they may stop multiple times on the same route.
For example, take a look at M04_stops_ccw.csv. The west exit of Kusatsu Station stops at the 1st and 37th stops.
The stop order is listed in the second row of CSV.
The bus stop name may be different between Excel and CSV. There are some that have different names, such as line breaks, half-width and full-width differences.
csv | excel |
---|---|
In front of Yamada Elementary School | Yamada Elementary School |
Kinokawa East | Kikawahigashi |
Nishishibukawa 1-chome | Nishishibukawa 1-chome |
Nomura 8-chome | Nomura 8-chome |
In front of Shindo Junior High School | Shindo Junior High School |
Normally, the data arrangement in Excel will be the same, but it will be different for each workbook.
Take a look at M01_stop_times.xlsx.
In this workbook, the presence or absence of "●" determines whether it is a Saturday or a weekday. But let's take a look at another M03_stop_times.xlsx.
Here, the days of the week are divided for each sheet.
Book name | How to determine the day of the week |
---|---|
M01_stop_times.xlsx | Judgment by ● |
M02_stop_times.xlsx | Judgment by ● |
M03_stop_times.xlsx | Sheet split |
M04_stop_times.xlsx | Sheet split |
M05_stop_times.xlsx | No mention of day of the week |
As you can guess from this, it's best to assume that the data starting position is different for each book.
Since the time is entered from the next line of the bus stop name, you may want to search from there and judge that the timetable is over if all the lines are blank.
However, it cannot. Let's take a look at the Yamada Line (Kinokawa circulation: counterclockwise) of M04_stop_times.xlsx. The next line of the bus stop name is a blank space, and the data starts from the next line.
In the case of the following sheet, even if the number of buses increases, the amount of data simply changes, so there is no change in the process of analyzing Excel.
But consider the next line.
In this example, as the amount of data increases, the data start position at the bottom also shifts, so it is necessary to change the processing.
To work with Excel in Python, use xlrd. https://github.com/python-excel/xlrd
Some samples using this library will come out when you google, but basically it is better to create it by referring to the official sample code.
https://github.com/python-excel/xlrd/blob/master/scripts/runxlrd.py
For example, the following implementation examples are common.
from xlrd import open_workbook
wb = open_workbook('test_err.xlsx')
for sh in wb.sheets():
for row in range(sh.nrows):
values = []
for col in range(sh.ncols):
v = sh.cell(row,col).value
if not isinstance(v, basestring):
v = str(v)
v = v + ':' + str(sh.cell(row,col).ctype)
values.append(v)
print ','.join(values)
The above code works fine with xLS extensions and xlsx without cell merging. However, if you operate a sheet that has the following cell merge, an error will occur.
error contents
Traceback (most recent call last):
File "test2.py", line 7, in <module>
v = sh.cell(row,col).value
File "C:\Python27\lib\site-packages\xlrd-0.9.3-py2.7.egg\xlrd\sheet.py", line
399, in cell
self._cell_types[rowx][colx],
IndexError: array index out of range
Apparently, the number of columns is different for each row, and I need to get the number of columns for each row. You can avoid this problem by getting the number of columns for each row with row_len as shown below.
from xlrd import open_workbook
wb = open_workbook('test_err.xlsx')
for sh in wb.sheets():
for row in range(sh.nrows):
values = []
for col in range(sh.row_len(row)):
v = sh.cell(row,col).value
if not isinstance(v, basestring):
v = str(v)
v = v + ':' + str(sh.cell(row,col).ctype)
values.append(v)
print ','.join(values)
In addition, the date display method is also implemented in runxlrd.py, so you should read it once for the time being.
As mentioned earlier, there are sections that make Excel in the mood, so it is necessary to respond flexibly to each. Therefore, I saved how to import the data in the JSON configuration file and tried to import the data by looking at it.
https://github.com/mima3/bus_data/blob/master/data/kusatu.json
This time, compressed data and uncompressed data are mixed. Therefore, the processing after download is described in the download of the setting file.
data/kusatu.json
"download" : {
"http://www.city.kusatsu.shiga.jp/kurashi/kotsudorokasen/mamebus/opendata.files/M01_stop_times.xlsx" : "save_local",
"http://www.city.kusatsu.shiga.jp/kurashi/kotsudorokasen/mamebus/opendata.files/M01_stops_ccw.csv" : "save_local",
"http://www.city.kusatsu.shiga.jp/kurashi/kotsudorokasen/mamebus/opendata.files/M01_shapes.zip" : "expand_zip",
save_local saves to the local disk. expand_zip executes the process of trying to unzip after saving.
See below for the actual code. https://github.com/mima3/bus_data/blob/master/downloader.py
Corresponds to fluctuations in CSV and Excel notation.
The basic rules are as follows. ・ Convert the bus station name according to the convert_rule of the configuration file ・ Remove line breaks ・ Change half-width to full-width.
** Bus stop name conversion **
data/kusatu.json
"convert_rule" : {
"In front of Yamada Elementary School": "Yamada Elementary School",
"Kinokawa East":"Kikawahigashi",
"Nishishibukawa 1-chome": "Nishishibukawa 1-chome",
"Nomura 8-chome": "Nomura 8-chome",
"In front of Shindo Junior High School": "Shindo Junior High School"
},
bus_data_parser.py
def convert_bus_stop_name(rule, bus_stops):
for bus_stop in bus_stops:
if bus_stop['stopName'] in rule:
bus_stop['stopName'] = rule[bus_stop['stopName']
** Remove line breaks and change half-width characters to full-width characters **
bus_data_parser.py
def get_bus_timetable(wbname, sheetname, stop_offset_row, stop_offset_col, stopdirection, timetable_offset_row, timetable_offset_col, chk_func):
xls = xlsReader(wbname, sheetname)
stop_name_list = []
if stopdirection == DataDirection.row:
busdirection = DataDirection.col
else:
busdirection = DataDirection.row
xls.set_offset(stop_offset_row, stop_offset_col)
while True:
v = xls.get_cell()
if not v:
break
v = zenhan.h2z(v)
v = v.replace('\n', '')
stop_name_list.append(v)
xls.next_cell(stopdirection)
Zenhan is used for half-width and full-width conversion. https://pypi.python.org/pypi/zenhan
Specify how to import each file as follows.
"import_rule" : [
{
"operation_company" : "Kusatsu City",
"line_name" : "Shopping street circulation line",
"shape" : "M01_shapes/M01.shp",
"srid" : 2448 ,
"timetables" : [
{
"route" : "Route1L",
"routeName" : "Shopping street circulation line",
"bus_stops" : "M01_stops_ccw.csv",
"weekday_timetable" : {
"workbook" : "M01_stop_times.xlsx",
"sheetname" : "M01_stop_times",
"stop_offset_row" : 6,
"stop_offset_col" : 3,
"timetable_offset_row" : 7,
"timetable_offset_col" : 3
},
"saturday_timetable" : {
"workbook" : "M01_stop_times.xlsx",
"sheetname" : "M01_stop_times",
"stop_offset_row" : 6,
"stop_offset_col" : 3,
"timetable_offset_row" : 7,
"timetable_offset_col" : 3,
"check_func" : "check_shoutengai_saturday"
},
"holyday_timetable" : {
}
}
]
}, //Abbreviation
{
"operation_company" : "Kusatsu City",
"line_name" : "Yamada Line (Kitayamada Circulation)",
"shape" : "M03_shapes/M03.shp",
"srid" : 2448 ,
"timetables" : [
{
"route" : "Route3R",
"routeName" : "Kitayamada circulation line clockwise",
"bus_stops" : "M03_stops_cw.csv",
"weekday_timetable" : {
"workbook" : "M03_stop_times.xlsx",
"sheetname" : "M03_stop_times (weekdays)",
"stop_offset_row" : 6,
"stop_offset_col" : 3,
"timetable_offset_row" : 7,
"timetable_offset_col" : 3
},
"saturday_timetable" : {
"workbook" : "M03_stop_times.xlsx",
"sheetname" : "M03_stop_times (saturday)",
"stop_offset_row" : 6,
"stop_offset_col" : 3,
"timetable_offset_row" : 7,
"timetable_offset_col" : 3
},
"holyday_timetable" : {
}
},
{
"route" : "Route3L",
"routeName" : "Kitayamada circulation line counterclockwise",
"bus_stops" : "M03_stops_ccw.csv",
"weekday_timetable" : {
"workbook" : "M03_stop_times.xlsx",
"sheetname" : "M03_stop_times (weekdays)",
"stop_offset_row" : 14,
"stop_offset_col" : 3,
"timetable_offset_row" : 15,
"timetable_offset_col" : 3
},
"saturday_timetable" : {
"workbook" : "M03_stop_times.xlsx",
"sheetname" : "M03_stop_times (saturday)",
"stop_offset_row" : 14,
"stop_offset_col" : 3,
"timetable_offset_row" : 15,
"timetable_offset_col" : 3
},
"holyday_timetable" : {
}
}
]
}, //Abbreviation
Related to shapefiles are "shape" and "srid". Describe the shape file name in shape and the geodetic system in srid.
Related to the csv file is "bus_stops". Describe the CSV file name in bus_stops.
Excel files are listed on weekdays, Saturdays, and holidays. Workbook name in workbook Sheet name in sheetname stop_offset_row, start position where the bus stop name is written in stop_offset_col, Describe the start position where the time is described in timetable_offset_row and timetable_offset_col. check_func is an optional item that specifies the callback function that will be executed each time a line of timetable is read.
Here, the value of a specific column is checked as shown below, and if the data is invalid, False is returned and the row is ignored. This is used for the judgment on Saturday.
import.py
class BusParserCallBack(object):
def check_shoutengai_saturday(self, workbook, sheet, busrow, buscol, item):
if sheet.cell(busrow - 1, 2 - 1).value:
return True
else:
return False
For Python, it's a good idea to use pyshp. Please refer to the following.
** Try to import the land numerical information ShapeFile into the database with Python ** http://qiita.com/mima_ita/items/e614a281807970427921
The geodetic system of the shape file of Mamebus is "plane right angle 2000 (6 system)", and SRID is 2448. This must be converted into a world geodetic system. This conversion is quite annoying, but it can be easily handled with a DB that handles geometry such as SpatiaLite.
In the case of Spatialite, you can return it by executing the following SQL.
select AsText(Transform(GeomFromText('POINT(-4408.916645 -108767.765479)', 2448), 4326))
The python code looks like this:
bus_db.py
for timetable in timetables:
database_proxy.get_conn().execute(
"""
INSERT INTO RouteTable
(metaData_id, operationCompany, lineName, route, routeName, geometry)
VALUES(?, ?,?,?,?,Transform(GeometryFromText(?, ?),?))
""",
(
meta_id,
operation_company,
line_name,
timetable['route'],
timetable['routeName'],
routedict[timetable['route']], src_srid, SRID
)
)
In this way, if you use the Python library, you can use the data of "Mamebus" without being aware of it.
However, the data is quite quirky and is not structured to assume mechanical analysis, so you will have a hard time there.
If you want to make it easier to handle data with a machine and improve the data side, I think the following points are necessary. -Make data consistency between different files such as CSV and Excel ・ Unify the format as it is unavoidable to use Excel ・ Consider when the amount of data increases.
Recommended Posts