[PYTHON] Let's utilize the railway data of national land numerical information

Purpose

With the national land numerical information download service, you can obtain data managed by the Ministry of Land, Infrastructure, Transport and Tourism. This time, I will plot the coordinates on Google Map using railway data.

demo: http://needtec.sakura.ne.jp/railway_location/railway

GIT: https://github.com/mima3/railway_location

About data

Railway data can be downloaded from the following page.

** National land numerical information Railway data ** http://nlftp.mlit.go.jp/ksj/gml/datalist/KsjTmplt-N02-v2_2.html

Please refer to the following for the use of XML in the downloaded file. http://nlftp.mlit.go.jp/ksj/gml/product_spec/KS-PS-N02-v2_1.pdf

Simply put, railroad data contains information that shows the shape of the line and station information. Here, the important elements are: ・ Gml: Curve curve information ・ Ksj: railroadSection Railway section information ・ Ksj: Station Station information

Coordinate information is stored in Curve. The link to Curve is stored in the location element of railroadSection and Station.

Storage in database

Since it is difficult to handle a large amount of data in XML, it is temporarily stored in a relational database.

At this time, a large size XML is analyzed, but if the entire XML file is once stored in memory and parsed, the memory usage will increase dramatically and it will not be possible to process it. Therefore, use lxml.etree.iterparse to process sequentially.

However, when parsing N02-XX.xml with lxml.etree.itreparse, an error occurs. This is because there is a line in the XML that looks like this:

 xmlns:schemaLocation="http://nlftp.mlit.go.jp/ksj/schemas/ksj-app KsjAppSchema-N02-v2_0.xsd">

lxml considers the URI specified here as an invalid URI and outputs an error. To avoid this, it is necessary to specify recover = True when parsing XML in lxml. http://stackoverflow.com/questions/18692965/how-do-i-skip-validating-the-uri-in-lxml

** Workaround: **

        context = etree.iterparse(
            xml,
            events=('end',),
            tag='{http://www.opengis.net/gml/3.2}Curve',
            recover=True
        )

In iterparse, this argument was introduced after lxml == 3.4.1, so you need to specify the version to install lxml.

easy_install lxml==3.4.1

Based on the above, the process of importing railway data XML into the database is as follows.

railway_db.py


# -*- coding: utf-8 -*-
import sqlite3
import sys
import os
# easy_install lxml==3.4.1
from lxml import etree
from peewee import *

database_proxy = Proxy()
database = None


class BaseModel(Model):
    """
Model class base
    """
    class Meta:
        database = database_proxy


class Curve(BaseModel):
    """
Curve information model
    """
    curve_id = CharField(index=True, unique=False)
    lat = DoubleField()
    lng = DoubleField()


class RailRoadSection(BaseModel):
    """
Railway section information model
    """
    gml_id = CharField(primary_key=True)
    #Foreign keys must have a primary key or a unique constraint, so
    #It cannot be specified as a foreign key for multiple data.
    location = CharField(index=True)
    railway_type = IntegerField()
    service_provider_type = IntegerField()
    railway_line_name = CharField(index=True)
    operation_company = CharField(index=True)


class Station(BaseModel):
    """
Station information model
    """
    gml_id = CharField(primary_key=True)
    #Foreign keys must have a primary key or a unique constraint, so
    #It cannot be specified as a foreign key for multiple data.
    location = CharField(index=True)
    railway_type = IntegerField()
    service_provider_type = IntegerField()
    railway_line_name = CharField(index=True)
    operation_company = CharField(index=True)
    station_name = CharField(index=True)
    railroad_section = ForeignKeyField(
        db_column='railroad_section_id',
        rel_model=RailRoadSection,
        to_field='gml_id',
        index=True
    )


def setup(path):
    """
Database setup
    @param path database path
    """
    global database
    database = SqliteDatabase(path)
    database_proxy.initialize(database)
    database.create_tables([Curve, RailRoadSection, Station], True)


def import_railway(xml):
    """
National Land Numerical Institute N02-XX.Import route and station information from xml
    TODO:
Inefficient import of foreign keys
    @param xml XML path
    """
    commit_cnt = 2000  #INSERT every number specified here
    f = None
    contents = None
    namespaces = {
        'ksj': 'http://nlftp.mlit.go.jp/ksj/schemas/ksj-app',
        'gml': 'http://www.opengis.net/gml/3.2',
        'xlink': 'http://www.w3.org/1999/xlink',
        'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
    }

    with database.transaction():
        insert_buff = []
        context = etree.iterparse(
            xml,
            events=('end',),
            tag='{http://www.opengis.net/gml/3.2}Curve',
            recover=True
        )
        for event, curve in context:
            curveId = curve.get('{http://www.opengis.net/gml/3.2}id')
            print (curveId)
            posLists = curve.xpath('.//gml:posList', namespaces=namespaces)
            for posList in posLists:
                points = posList.text.split("\n")
                for point in points:
                    pt = point.strip().split(' ')
                    if len(pt) != 2:
                        continue
                    insert_buff.append({
                        'curve_id': curveId,
                        'lat': float(pt[0]),
                        'lng': float(pt[1])
                    })
                    if len(insert_buff) >= commit_cnt:
                        Curve.insert_many(insert_buff).execute()
                        insert_buff = []
        if len(insert_buff):
            Curve.insert_many(insert_buff).execute()
        insert_buff = []
        context = etree.iterparse(
            xml,
            events=('end',),
            tag='{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}RailroadSection',
            recover=True
        )
        for event, railroad in context:
            railroadSectionId = railroad.get(
                '{http://www.opengis.net/gml/3.2}id'
            )
            locationId = railroad.find(
                'ksj:location',
                namespaces=namespaces
            ).get('{http://www.w3.org/1999/xlink}href')[1:]
            railwayType = railroad.find(
                'ksj:railwayType', namespaces=namespaces
            ).text
            serviceProviderType = railroad.find(
                'ksj:serviceProviderType',
                namespaces=namespaces
            ).text
            railwayLineName = railroad.find(
                'ksj:railwayLineName',
                namespaces=namespaces
            ).text
            operationCompany = railroad.find(
                'ksj:operationCompany',
                namespaces=namespaces
            ).text
            insert_buff.append({
                'gml_id': railroadSectionId,
                'location': locationId,
                'railway_type': railwayType,
                'service_provider_type': serviceProviderType,
                'railway_line_name': railwayLineName,
                'operation_company': operationCompany
            })
            print (railroadSectionId)
            if len(insert_buff) >= commit_cnt:
                RailRoadSection.insert_many(insert_buff).execute()
                insert_buff = []
        if len(insert_buff):
            RailRoadSection.insert_many(insert_buff).execute()

        insert_buff = []
        context = etree.iterparse(
            xml,
            events=('end',),
            tag='{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}Station',
            recover=True
        )
        for event, railroad in context:
            stationId = railroad.get('{http://www.opengis.net/gml/3.2}id')
            locationId = railroad.find(
                'ksj:location', namespaces=namespaces
            ).get('{http://www.w3.org/1999/xlink}href')[1:]
            railwayType = railroad.find(
                'ksj:railwayType',
                namespaces=namespaces
            ).text
            serviceProviderType = railroad.find(
                'ksj:serviceProviderType',
                namespaces=namespaces
            ).text
            railwayLineName = railroad.find(
                'ksj:railwayLineName',
                namespaces=namespaces
            ).text
            operationCompany = railroad.find(
                'ksj:operationCompany',
                namespaces=namespaces
            ).text
            stationName = railroad.find(
                'ksj:stationName',
                namespaces=namespaces
            ).text
            railroadSection = railroad.find(
                'ksj:railroadSection',
                namespaces=namespaces
            ).get('{http://www.w3.org/1999/xlink}href')[1:]
            print (stationId)
            insert_buff.append({
                'gml_id': stationId,
                'location': locationId,
                'railway_type': railwayType,
                'service_provider_type': serviceProviderType,
                'railway_line_name': railwayLineName,
                'operation_company': operationCompany,
                'station_name': stationName,
                'railroad_section': RailRoadSection.get(
                    RailRoadSection.gml_id == railroadSection
                )
            })
            if len(insert_buff) >= commit_cnt:
                Station.insert_many(insert_buff).execute()
                insert_buff = []
        if len(insert_buff):
            Station.insert_many(insert_buff).execute()

Once stored in the database, the rest is easy to use.

Precautions for use

The points I noticed when handling numerical national land information (railway data) are described below.

・ It is not possible to narrow down by just the route name. For example, in the case of "Line 1", "Yokohama City" may hold it or "Chiba Monorail" may hold it. Therefore, it is necessary to narrow down by "operating company" and "route name".

・ The name may be different from the name you always use. JR East has become East Japan Railway Company, and Tokyo Metro has become Tokyo Metro.

・ The route may be different from the one you always use. For example, in a normal route map, "Tokyo" is included in the "Chuo Line". However, "Tokyo" is not included in the "Chuo Line" as national land numerical information. "Tokyo"-"Kanda" is considered to be the "Tohoku Line". It seems that this is because the section between Tokyo Station and Kanda Station runs on a dedicated track laid on the Tohoku Main Line.

Recommended Posts

Let's utilize the railway data of national land numerical information
Try to display the railway data of national land numerical information in 3D
Try to import to the database by manipulating ShapeFile of national land numerical information with Python
Numerical summary of data
Let's use the open data of "Mamebus" in Python
Extract the band information of raster data with python
Let's make the analysis of the Titanic sinking data like that
Let's automatically collect company information (XBRL data) using the EDINET API (4/10)
Basic map information using Python Geotiff conversion of numerical elevation data
Let's decide the winner of bingo
Explain the mechanism of PEP557 data class
Basics of Quantum Information Theory: Data Compression (1)
The story of verifying the open data of COVID-19
Get the column list & data list of CASTable
Let's claim the possibility of pyenv-virtualenv in 2021
Let's summarize the construction of NFS server
Let's investigate the mechanism of Kaiji's cee-loline
Visualize the export data of Piyo log
Basics of Quantum Information Theory: Data Compression (2)
Let's check the population transition of Matsue City, Shimane Prefecture with open data