[PYTHON] Visualize railway line data as a graph with Cytoscape 2

Introduction

This series uses Cytoscape, IPython Notebook, Pandas This is an article for visualization practitioners that introduces the process of actual graph visualization based on public data using open source tools such as .pydata.org/).

Change log

Data processing in an interactive environment

__ Figure 1__: A graph connecting railway systems all over Japan. High resolution version is here

Introduction

Last time processed the file downloaded from the data source using IPython Notebook. I've even loaded it into Cytoscape. However, in the previous state, there is no problem in arranging nodes (stations) using latitude and longitude, but the actual route data itself is not a graph. The figure below shows another automatic layout algorithm of the previous data. % 87% E3% 83% AB_ (% E3% 82% B0% E3% 83% A9% E3% 83% 95% E6% 8F% 8F% E7% 94% BB% E3% 82% A2% E3% 83% Visualized using AB% E3% 82% B4% E3% 83% AA% E3% 82% BA% E3% 83% A0)):

(High resolution version is here)

If you zoom in, you can see that there are connections for each line, but each one exists independently:

With this, the automatic layout, path search, and other functions will not work well. First of all, I would like to solve this problem by interactively programming with IPython Notebook as before, and then integrate and visualize the data provided by public institutions.

This goal


A notebook that records the actual work

I will add it from time to time, but you can see the record of the actual work here:

This can be done on your machine as long as you have the libraries used in your notebook installed. Even if you're not a Python programmer, you're not doing anything complicated, so you can understand it by following the instructions in the notes. The data pre-processing performed in the notebook is as follows.

About environment construction when working with Python

Basically I'm working on a UNIX-like operating system, but in that case it's convenient to build an environment with Anaconda:

This kind of data cleansing includes Pandas, NumPy I often use libraries such as .numpy.org/) and SciPy, but they take good care of the library dependencies around here. Besides, it works well with the pip command. Library installation

conda install LIBRARY_NAME

You don't have to think about anything because you can almost solve it with the command.

Details of data preparation work

Now let's see what is going on in the notebook. If possible, it will be easier to understand if you read while actually executing the notes.

Connect disjointed route data for each station group

Make a connection using the information called Station Group in the original data. Being in the same group means that you can transfer as is, or that those stations are within walking distance. Therefore, it can be considered that the stations in the group are practically connected to other lines. In the actual work, we will connect those stations in the same group with a new edge, and [Creek](http://ja.wikipedia.org/wiki/%E3%82%AF%E3%83%AA % E3% 83% BC% E3% 82% AF_ (% E3% 82% B0% E3% 83% A9% E3% 83% 95% E7% 90% 86% E8% AB% 96)) Masu:

__ Figure 2 __: Part of the created creek

By merging these creeks into the original route data, a railway network connecting routes nationwide is formed as shown in Fig. 1.

Check the state of the connected graph

Here, a little off the side road, how the creek was actually incorporated is the largest station in Japan [Shinjuku](http://ja.wikipedia.org/wiki/%E6%96%B0%E5%AE Let's take% BF% E9% A7% 85) as an example. Shinjuku Station is a huge hub for the railway network, with each railway line on board. __ If you visualize this, you should be able to see the creeks made up of stations belonging to the same group and each line extending radially from there __. I actually tried it for confirmation. I will omit the part to be read into Cytoscape (for now), but the basic operation to make the following figure is

If you do this, you can draw a diagram like the one below:

(High resolution version)

The green dotted line indicates the transferable route (within walking distance), and the solid line indicates each route. As far as I can see this, it seems that the connection was successful.

Map other publicly available datasets on the rail network

Now you have a "railroad blank map" that you can use on Cytoscape. There is a reason to call this data a blank map. That's because __ you can freely map other datasets on this network to create your own visualizations __. This is the advantage of dropping geographic data into a conceptual graph structure and the point of using it in Cytoscape. There are a myriad of possible mappings, but a simple example is:

These are all simple examples, but the more data you can map, the greater the possibilities for further visualization. First of all, I thought that there might be some simple data so that I could experience the basic functions, but I thought that it would be easy to understand the "number of passengers per day", so I searched for publicly available data. started.

Obtaining and cleaning public data

To tell the truth, I have never seen Japanese public data sets seriously. I found out when I actually started looking for it, but the statistics that each ministry and agency knows are published for the time being, but there is still a strong tendency to publish it as a sentence that __ people read __. Although it is a very simple data set called "Number of passengers per station", the table is not published as a pair of station name-number of passengers. It was this data that I "excavated" this time:

Since it is published as an XML file, I thought that it should be relatively easy to process, so I started work, but suddenly the parser broke. The cause was a simple mistake of closing the tag,

cat S12-13.xml | sed -e "s/ksj:ailroad/ksj:railroad/" > fixed.xml

It can be repaired with.

Problems when actually using

When I actually moved my hand and scrutinized the contents, I realized that it could not be as simple as I expected. Some of the issues are:

At this point, I was a little less motivated, but I decided to do only the part that I could do as a demonstration of the method, and proceeded without seeing such part (I think that this will not be done at work ...) .. After doing some ridiculous work like the one in the notebook, I got a table like this:

スクリーンショット 2014-08-16 22.32.40.png

I wrote in the notebook the reason why I purposely converted the coded information into redundant character string information, but it is easier to work with visualization software by saving it as a human readable character string. .. __ This kind of work is rather harmful for large-scale analysis and visualization, so determine the size of the data set and compare the convenience with the weight for the computer and use an appropriate method. Let's choose __.

Obtaining route theme color information by scraping

Since the WWW started with the idea of linking human-readable texts, there is a huge amount of data published as tables designed to be read by humans without assuming that they will be processed by machines. And such data is surprisingly useful and the information you want to use is buried.

When visualizing the network with Cytoscape, you can freely create a customized color map for all the elements on the screen. For this dataset, it's easier to understand and more effective to use something that people are familiar with than to set the color coding yourself. For Tokyo Metro, the following colors are standard:

([Wikimedia Commons](http://commons.wikimedia.org/wiki/File:Tokyo_metro_map.png#mediaviewer/%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB : From Tokyo_metro_map.png))

I wondered if I could get this color information in a file format that could be easily read by a machine, but unfortunately I couldn't find it (please let me know if anyone knows it). However, it was well published as a human-readable sentence:

So I decided to forcibly cut it out from this page. There are various exceptions to the data buried in this text, but in many cases there are some patterns, so it's easy to just take a quick look:

<tr style="height:20px;">
  <td>Line 3</td>
  <td>
    <a href="/wiki/%E6%9D%B1%E4%BA%AC%E3%83%A1%E3%83%88%E3%83%AD%E9%8A%80%E5%BA%A7%E7%B7%9A" title="Tokyo Metro Ginza Line">
Ginza line
    </a>
  </td>
  <td>G</td>
  <td style="background:#f39700; width:20px;">&#160;</td>
  <td><b>Orange</b></td>
</tr>

If you combine the data __ # f39700__ in this with the key __ Tokyo Metro Ginza Line __, it will be read by Cytoscape, and the color data will be _ Passthrough Mapping. You can use it as it is with .org / Cytoscape_3 / UserManual # Cytoscape_3.2BAC8-UserManual.2BAC8-Styles.How_Mappings_Work) _. Fortunately, Python has a lot of libraries to do this kind of scraping work, so I used them to make a rough cutout. This work can be improved as much as you like, but for demonstration purposes, it can be dirty, so I've kept the results to a minimum.

Check scraping results

This time, we will visualize the subject of Tokyo Metro, so make sure that the data of that part is taken properly. Let's do it with one liner.

grep tokyo metro line_colors.csv | awk -F ',' '{print "<span style=\"color:" $3 "\">" $2 "</span><br />"}' > metro_colors.html

The result looks like this, and you can see that there is no problem comparing it with the original data:

(Generated HTML) metrocolor.png

(Original Table) metro_actual.png

As mentioned above, including the previous part, All work is recorded in IPython Notebook. , The execution itself will take tens of seconds. If you are interested, please take a look at the generated text file (CSV). All of these can be easily loaded into Cytoscape.

Integration and mapping on Cytoscape

Now it's finally visualization with Cytoscape. But it's getting pretty long, so I'll carry over the details of this part next time. As the next preview, I will briefly introduce what kind of processing is possible once it is loaded into Cytoscape.

The desktop you are working on with everything loaded

(High resolution version)

Each station of Tokyo Metro showing the relative positional relationship using the position on the map

(High resolution version)

An example of applying another layout algorithm.

Label size maps to number of passengers per day

(High resolution version)

Yamanote line as a simple connection diagram

(High resolution version)

Combine automatic and manual layout (layout applied to each route)

(High resolution version)

At the end

__ "All public statistics are published in a Machine-Readable form, code-writing humans can use them to create new value-creating applications, and statisticians gain new insights from the results of their integration."

Such a world may come someday. But at least for now this is the reality. Excavating "dirty data", cleaning it, and putting it together in a usable form is a very steady and boring process, but it is an unavoidable process at present. The tools are ready. If you can write code, move your hand and list what's wrong. And let's tell the provider. In the long run, that's probably the only solution.

As you can see from the Notebook, it's a general click. Unlike programming, when preparing data for a particular visualization, I have a policy of __ that you can read back what you did rather than __efficiency or elegance. It still takes some trial and error in terms of workflow reusability, but rather than doing more elaborate things than necessary, a visualization app (Cytoscape, of course, D3.js It may be necessary for people working in the field of visualization to focus on creating an easy-to-use data set that can be used in) (including custom visualization applications created in) and proceed with work while making some compromises. not. Now it is possible to combine tools such as Git, IPython Notebook, and RStudio to automatically save records including the process of thinking without stress. I think it's a good idea to use these to gradually find a workflow that suits you.


I'm sorry for those who are not members because it is an FB group, but if you are interested in such visualization, please join this group. We plan to share problems and know-how. A group for people who move their hands in the field of visualization.

Continued from the 3rd

Recommended Posts

Visualize railway line data as a graph with Cytoscape 2
[Python] Draw a directed graph with Dash Cytoscape
Folium: Visualize data on a map with Python
Visualize data with Streamlit
[Python] How to draw a line graph with Matplotlib
Draw a graph with NetworkX
Graph Excel data with matplotlib (2)
Draw a graph with networkx
Draw a loose graph with matplotlib
Draw a graph with Julia + PyQtGraph (1)
Draw a graph with Julia + PyQtGraph (3)
Draw a graph with pandas + XlsxWriter
Process the dotted line as a solid line with camelot (Hough transform)
Let's make a graph with python! !!
Make a nice graph with plotly
Draw a graph with PySimple GUI
[PyQt] Display a multi-axis graph with QtChart
Add a Python data source with Redash
Read a character data file with numpy
Draw a graph with PyQtGraph Part 1-Drawing
Create a graph with borders removed with matplotlib
I made a tool to easily display data as a graph by GUI operation.
Visualize railway line data and solve the shortest path problem (Python + Pandas + NetworkX)
Acquire the data of Mitsubishi UFJ International Investment Trust eMAXIS with Python and make a graph with the beginning of the term as 100
I made a stamp substitute bot with line
Visualize corona infection data in Tokyo with matplotlib
Draw a flat surface with a matplotlib 3d graph
Send a message to LINE with Python (LINE Notify)
How to send a message to LINE with curl
Draw a graph with Japanese labels in Jupyter
How to draw a 2-axis graph with pyplot
Read Python csv data with Pandas ⇒ Graph with Matplotlib
Delete data in a pattern with Redis Cluster
A memo that made a graph animated with plotly
Create a LINE BOT with Minette for Python
Draw a graph with PyQtGraph Part 3-PlotWidget settings
Draw a graph by processing with Pandas groupby
Visualize Yu-Gi-Oh! Card data with Python-Yu-Gi-Oh! Data Science 1. EDA
A story stuck with handling Python binary data
I'm addicted to Kintone as a data store
Interactively visualize data with TreasureData, Pandas and Jupyter.
Simply display a line graph on Jupyter Notebook
I made a LINE Bot with Serverless Framework!
Draw a graph with PyQtGraph Part 4-PlotItem settings
Draw a graph with matplotlib from a csv file
Draw a graph with PyQtGraph Part 6-Displaying a legend
Make holiday data into a data frame with pandas
Make a LINE WORKS bot with Amazon Lex
Read line by line from a file with Python
I made a random number graph with Numpy
Use a cool graph to analyze PES data!
Visualize grib2 on a map with python (matplotlib)
Extract data from a web page with Python
I created a stacked bar graph with matplotlib in Python and added a data label