[PYTHON] I tried to complement the knowledge graph using OpenKE

The knowledge graph was complemented using an open source framework called OpenKE. As a memo for myself, I will transcribe the result.

Article flow

[Target of this article](https://qiita.com/yuta0919/private/b49b5a8120bd3336d1d3#%E6%9C%AC%E8%A8%98%E4%BA%8B%E3%81%AE%E5% AF% BE% E8% B1% A1)
[What is the Knowledge Graph](https://qiita.com/yuta0919/private/b49b5a8120bd3336d1d3#%E7%9F%A5%E8%AD%98%E3%82%B0%E3%83%A9%E3% 83% 95% E3% 81% A8% E3% 81% AF)
What is OpenKE
[Program to use](https://qiita.com/yuta0919/private/b49b5a8120bd3336d1d3#%E5%AE%9F%E8%A1%8C%E3%83%97%E3%83%AD%E3%82 % B0% E3% 83% A9% E3% 83% A0% E3% 81% AE% E4% BD% 9C% E6% 88% 90)
Execution result
[Comparison / Evaluation with GitHub](https://qiita.com/yuta0919/private/b49b5a8120bd3336d1d3#%E8%AB%96%E6%96%87%E3%81%A8%E3%81%AE% E6% AF% 94% E8% BC% 83% E8% A9% 95% E4% BE% A1)
Summary

Subject of this article

This article applies to people who fall under any of the following.

--People who are interested in the knowledge graph --People who want to know what python can do --People who want to use OpenKE

What is a knowledge graph?

The knowledge graph shows the connection of various knowledge as a structure.

** Example) ** (obama, born-in, Hawaii)

Data that has a subject, the form of relations, and object relations such as is called a knowledge graph.

If the subject, the form of relations, and the object are $ s, r, and o $, respectively, they remain from the relationship between $ s $ and $ r $ or $ o $ and $ r $. The purpose of this time is to guess $ o and s $.

What is OpenKE

OpenKE is an open source created by Tsinghua University Natural Language Processing and Social Studies Laboratory (THUNLP). It is a framework.

It is a framework dedicated to knowledge graphs written in C ++ and python, and currently seems to support pytorch and tensorflow.

For details, please refer to the following github link or the OpenKE homepage. OpenKE Home Page OpenKE's github

Program to use

Next, the program to be actually executed is shown below. This time, we will use train_distmult_WN18.py in examples.

import openke
from openke.config import Trainer, Tester
from openke.module.model import DistMult
from openke.module.loss import SoftplusLoss
from openke.module.strategy import NegativeSampling
from openke.data import TrainDataLoader, TestDataLoader

# dataloader for training
train_dataloader = TrainDataLoader(
	in_path = "./benchmarks/WN18RR/",
	nbatches = 100,
	threads = 8,
	sampling_mode = "normal",
	bern_flag = 1,
	filter_flag = 1,
	neg_ent = 25,
	neg_rel = 0
)

# dataloader for test
test_dataloader = TestDataLoader("./benchmarks/WN18RR/", "link")

# define the model
distmult = DistMult(
	ent_tot = train_dataloader.get_ent_tot(),
	rel_tot = train_dataloader.get_rel_tot(),
	dim = 200
)

# define the loss function
model = NegativeSampling(
	model = distmult,
	loss = SoftplusLoss(),
	batch_size = train_dataloader.get_batch_size(),
	regul_rate = 1.0
)


# train the model
trainer = Trainer(model = model, data_loader = train_dataloader, train_times = 2000, alpha = 0.5, use_gpu = True, opt_method = "adagrad")
trainer.run()
distmult.save_checkpoint('./checkpoint/distmult.ckpt')

# test the model
distmult.load_checkpoint('./checkpoint/distmult.ckpt')
tester = Tester(model = distmult, data_loader = test_dataloader, use_gpu = True)
tester.run_link_prediction(type_constrain = False)

test_dataloader is "./benchmarks/WN18RR/" model is distmult The loss function is SoftplusLoss () I will leave it. dim is set to the form of 200. All are the same as when downloaded.

There are several other types of executable programs in examples.

Regarding settings

There are three parts that can be changed: dataset, model, and loss.

Make sure that "./benchmarks/WN18RR/" of train_dataloader and test_dataloader are the same. You can use the dataset in the link below for this benchmark. benchmarks

Variables in TrainDataLoader can be changed freely. In addition to nomal, cross can be selected for sampling_mode. (The cross setting may require a slight change to the deeper setting.)

For the model, please refer to the link below. Available models

In addition to Softplus Loss, Margin Loss and Sigmoid Loss can be used for loss.

Execution result

The execution result is as follows. I don't have a GPU machine so I ran it with google colaboratory. スクリーンショット 2020-05-27 19.13.33.png

Comparison / evaluation with GitHub

Let's compare it with the Experiments table on GitHub. The table seems to be the value at Hits @ 10 (filter). スクリーンショット 2020-05-28 13.14.24.png

The average of the experimental results was 0.463306, so the accuracy was 0.015 lower than the value of DistMult on GitHub.

The improvement is to adopt another loss function. Also, I think one way is to change the values of neg_ent, neg_rel, and alpha.

Summary

This time, I tried to complement the knowledge graph using OpenKE. As a result, we did not get the expected results, but since there was room for improvement, we would like to start the improvement points shown above.

Thank you for reading until the end.