[PYTHON] Visualize Wikidata knowledge with Neo4j

This is the article on the 8th day of Advent Calendar 2019, NTT DoCoMo Service Innovation Department. This time, I will visualize the data of ** Wikidata **, which is one of the structured knowledge bases, using ** Neo4j **, which is one of the graph DBs.

What is Graph DB?

Simply put, it's a database that can handle graph structures. Compared to the popular RDB, it is a database designed to handle ** relationships ** between data [^ 1]. By the way, a graph is not a line graph or bar graph, but a data structure as shown in the figure below, which is represented by a set of nodes (contact points) and edges (branches) (quoted from Wikipedia).

6n-graf.png

I don't know what it is useful to look at this figure alone, but the graph structure is very useful for expressing various things in the real world. For example, if you think of a station as a node and a railroad track as an edge, you can express a route map as a graph, and if you think of a city as a node and a road as an edge, you can express a transportation problem. You can also graph SNS by looking at accounts as nodes and relationships between accounts as edges. In terms of practical application, it seems that the purchase history can be expressed in a graph structure and used for product recommendation [^ 2].

Monogoto that can be expressed node Edge
route map station line
logistics city road
SNS account account間の関係
In-house personnel Employee Employee同士の関係
Wikipedia page page間のリンク

Recently, it seems that attention to graph structure is increasing even in the language processing area. For example, at ACL, which is the top international conference on natural language processing, the number of GCN (Graph Convolutional Netowrk) related papers submitted was 3 last year. This year, the number has increased significantly to 11.

Let's use this graph DB to visualize the relationship between certain facts.

Information extraction from Wikidata

In order to visualize, data to be visualized is required, but it is difficult to manually create data from scratch. Therefore, this time, we will use the dump of ** Wikidata ** [^ 3], which is a pre-structured knowledge base, to create the data to be imported into the graph DB. "Structured" means "easy to handle on a computer".

Wikidata is a collaborative knowledge base and one of the same Wikimedia projects as Wikipedia. In Wikidata, for example, "knowledge" that "John Lennon's nationality is England" is expressed in triplets such as (John Lennon, nationality, United Kingdom). This form of (entity 1, property, entity 2) is called a ** relationship triple **. You can think of the entity here as the title of a page on Wikipedia. Each entity has a Wikidata-specific identifier that starts with "Q" (for example, Q5 refers to "human"). Similarly, the property has a Wikidata-specific identifier that starts with "P".

All Wikidata data is dumped in JSON format every Wednesday, so let's use this as the data to be imported into the graph DB. Download either latest-all.json.bz2 or latest-all.json.gz from here. For more information on the JSON structure inside, see here.

For now, you can run the Python script as shown below to extract entity and property information, or relational triples, from the dump (note that it can be time consuming and memory consuming).

Sample script
#!/usr/bin/env python
# coding: utf-8

import bz2
import json
import codecs

triples = []
qs = []
with bz2.BZ2File('latest-all.json.bz2', 'r') as rf, \
     codecs.open('rdf.tsv', 'w', 'utf-8') as rdff, \
     codecs.open('q_id.tsv', 'w', 'utf-8') as qf:
    next(rf)  #Skip the first line
    for i, line in enumerate(rf, 1):
        try:
            line = json.loads(line[:-2])
        except json.decoder.JSONDecodeError:
            print(i)
            rdff.write('\n'.join(['\t'.join(x) for x in triples]) + '\n')
            qf.write('\n'.join(['\t'.join(x) for x in qs]) + '\n')
            triples = []
            qs = []
            continue

        try:
            ett_id = line['id']
        except KeyError:
            ett_id = None
        try:
            ett_name = line['labels']['ja']['value']
        except KeyError:
            ett_name = None

        if ett_id is not None and ett_name is not None:
            qs.append((ett_id, ett_name))
            triple = []
            for _, props in line['claims'].items():
                for prop in props:
                    p_id = prop['mainsnak']['property']
                    try:
                        id_ = prop['mainsnak']['datavalue']['value']['id']
                    except Exception as e:
                        # print(ett_id, p_id, e)
                        continue
                    triple.append((ett_id, p_id, id_))
            triples.extend(triple)
            triple = []

        if i % 10000000 == 0:
            print(i)
            rdff.write('\n'.join(['\t'.join(x) for x in triples]) + '\n')
            qf.write('\n'.join(['\t'.join(x) for x in qs]) + '\n')
    rdff.write('\n'.join(['\t'.join(x) for x in triples]) + '\n')
    qf.write('\n'.join(['\t'.join(x) for x in qs]) + '\n')

q_id.tsv is a tab-delimited file as shown in the table below (this file contains not only Q ID but also P ID).

Q ID Entity name
Q31 Belgium
Q8 happiness
Q23 George Washington
Q24 Jack Bauer
Q42 Douglas Adams

Also, rdf.tsv is tab-delimited data as shown in the table below.

Entity 1 Property Entity 2
Q31 P1344 Q1088364
Q31 P1151 Q3247091
Q31 P1546 Q1308013
Q31 P5125 Q7112200
Q31 P38 Q4916

Combine the above two files to create two types of files. First, add a header to q_id.tsv to create and save the tab-delimited data nodes.tsv as shown in the table below (the third column : LABEL may or may not be present. Since there is no such thing, you can actually just rename the file name).

id:ID name :LABEL
Q31 Belgium Entity
Q8 happiness Entity
Q23 George Washington Entity
Q24 Jack Bauer Entity
Q42 Douglas Adams Entity

At the same time, the properties between entities are also saved as tab-delimited data relationships.tsv as shown in the table below. Just add a header to rdf.tsv and replace: TYPE in the second column with a string from "P000" by referring to q_id.tsv (actually, there is no need to bother to replace it as well) There is no such thing, so you can just rename the file name).

:START_ID :TYPE :END_ID
Q23 spouse Q191789
Q23 father Q768342
Q23 mother Q458119
Q23 Brothers and sisters Q850421
Q23 Brothers and sisters Q7412891

There are various types of properties such as "nationality", "spouse", "place of birth", and "birthday", but there are too many as they are, so in this visualization, between people I narrowed down to the properties that can be defined. For example, "relatives," "father," "mother," "master," and "disciple." Also, the entities are limited to those with Japanese names.

Try using Neo4j

There are various types of graph DB, but here we will introduce Neo4j [^ 4], which is relatively popular. Besides Neo4j, I think Amazon Neptune is famous (the name is cool) [^ 5].

Installation

For macOS, it is recommended to install with Homebrew.

$ brew cask install homebrew/cask-versions/adoptopenjdk8  #If Java is not included
$ brew install neo4j

If the version of Java is different

neo4j: Java 1.8 is required to install this formula.
Install AdoptOpenJDK 8 with Homebrew Cask:
  brew cask install homebrew/cask-versions/adoptopenjdk8
Error: An unsatisfied requirement failed this build.

Please note that you will get angry.

If you enter $ which neo4j and the message / usr / local / bin / neo4j is displayed, the installation is complete.

Data import

Import the nodes.tsv and relationships.tsv created earlier into Neo4j. To import the data, hit the following command.

$ neo4j-admin import --nodes ./Downloads/nodes.tsv --relationships ./Downloads/relationships.tsv --delimiter="\t"

If the import is successful

IMPORT DONE in 9s 735ms.
Imported:
  2269484 nodes
  201763 relationships
  6808452 properties
Peak memory usage: 1.05 GB

Is displayed.

Use the following commands to start and stop the Neo4j server.

$ neo4j start  #When starting the server
$ neo4j stop  #When stopping the server

After starting the server, try accessing http: // localhost: 7474. When you access for the first time, you will be asked to log in, so enter "neo4j" for the user name and password respectively. After that, you will be asked to change the password, so change it to any password.

Process data with Cypher

Now you are ready. Let's use Neo4j immediately. Neo4j uses a SQL-like language called ** Cypher ** as a query (Cypher is referred to as CQL below). Please note that this article does not provide a detailed explanation of CQL.

December 8th, when this article is posted, is the anniversary of John Lennon's death, so let's take John Lennon as the subject. As an aside, I like "Jealous Guy" the most.

n-hop search

For example, if you want to see the entity associated with "John Lennon", issue the following CQL. This CQL returns all Entity nodes from the "John Lennon" node to 3-hop.

match p=((:Entity{name:"John Lennon"})-[*1..3]-()) return p

graph-6.png

It's a little confusing, but the "John ..." node at the bottom left of the center is the "John Lennon" node. For example, from "John Lennon" to "Yoko Ono" (1)-> "Zensaburo Yasuda" (2)-> "Kataoka Nizaemon" (3), you can see that the relationship extends to 3-hop destination. I will. The bottom right lump is the familiar Paul McCartney and his family.

Unfortunately, there are no names of Ringo Starr or George Harrison who were members of The Beatles in this graph.

Shortest path search

Neo4j can search for the shortest path, which is one of the main features.

As an attempt, let's search for the shortest route between "Natsume Soseki" and "Ogai Mori", the literary masters of the Meiji era. December 8th has nothing to do with it. As an aside, I like "Papaver rhoeas" the most.

Issue a CQL that looks like this: Simply enclose the CQL "Return all nodes between" Natsume Soseki "and" Ogai Mori "" in shortest path.

match p=shortestpath((:Entity{name:"Natsume Soseki"})-[*]-(:Entity{name:"Ogai Mori"})) return p

graph-5.png

"John ..." at the top is not John Lennon but "John Manjiro".

The two lived at the same time, but based on Wikidata's knowledge, the relationship is surprisingly distant. According to historical facts, it seems that there was no interaction that seemed to be an exchange with each other as they knew each other.

Then, what is the shortest route between "John Lennon" and "Natsume Soseki", which seem to be unrelated?

match p=shortestpath((:Entity{name:"Natsume Soseki"})-[*]-(:Entity{name:"John Lennon"})) return p

graph-4.png

After working from "Natsume Soseki" to "Ryunosuke Akutagawa" and "Samuel Beckett" (author of the play "Waiting for Godot"), he was famous for "Bob Dylan" ("Like a Rolling Stone" and won the Nobel Prize in Literature. ), After passing the bridge between literature and music, it seems that you will reach "Jimi Hendrix", "Prince", "Shakira", and "John Lennon". To be honest, I can't deny the feeling of being a little abrupt and roundabout [^ 6].

Wikidata is reasonably large structured data, but it's probably not enough to use right away as a knowledge base. You will need to expand your knowledge base with, for example, ** Relationship Extraction **. Relationship extraction is a technique for extracting the relationship triple (Natsume Soseki, author, "Botchan") from the sentence "Natsume Soseki, the author of" Botchan "".

Summary

I tried to visualize the knowledge of Wikidata using Neo4j. The graph drawn with Neo4j can be moved undulatingly on the browser and is fun, so please practice and touch it yourself.

[^ 2]: https://ja.wikipedia.org/wiki/ I am familiar with graph theory

[^ 5]: Mr. Hayashi wrote an article on our Advent calendar last year → https://qiita.com/dcm_hayashi/items/9b2536b6fbffa0118fad

[^ 6]: "Natsume Soseki"'s best friend from the first high school, "Nakamura Yoshikoto" is known as one of the confidants of "Goto Shinpei" who has served as the president of South Manchuria Railway and the mayor of Tokyo. After that, it was "Zenjiro Yasuda" who agreed with the city planning proposed by Fuji (By the way, the Hibiya Public Hall was built at this time). And if you follow Yasuda's great-grandson as "Yoko Ono" and her spouse as "John Lennon", it can be much shorter than the result of Wikidata.

Recommended Posts

Visualize Wikidata knowledge with Neo4j
Quickly visualize with Pandas
Visualize data with Streamlit
Visualize claims with AI
sandbox with neo4j part 10
Visualize 2ch threads with WordCloud-Scraping-
Web scraping with Python ① (Scraping prior knowledge)
Visualize decision trees with jupyter notebook
Visualize python package dependencies with graphviz