[PYTHON] Extract redirects from Wikipedia dumps

I will leave it as a personal memorandum. I would like to write it as concisely as possible so that I can get the file quickly.


Collect Wikipedia redirects and create a file like the one below.

 {"src": "COVID-19", "dst": "New Coronavirus Infection_ (2019)"}
 {"src": "COVID-2019", "dst": "Coronavirus disease _ (2019)"}
 {"src": "Covid-19", "dst": "New Coronavirus Infection_ (2019)"}
 {"src": "Covid-2019", "dst": "Coronavirus disease _ (2019)"}
 {"src": "New Coronavirus Infection", "dst": "New Coronavirus Infection_ (2019)"}
 {"src": "Covid 19", "dst": "New Coronavirus Infection_ (2019)"}
 {"src": "COVID19", "dst": "Coronavirus disease _ (2019)"}
 {"src": "2019 New Coronavirus Infection", "dst": "New Coronavirus Infection_ (2019)"}

What is a redirect?

See below [Wikipedia: Redirect](https://ja.wikipedia.org/wiki/Wikipedia: Redirect)

Redirect example

When I try to access https://ja.wikipedia.org/wiki/COVID-19, You will be automatically skipped to https://ja.wikipedia.org/wiki/New Coronavirus Infection_ (2019).

Implementation etc.

0. Various things you need

1. Restore Wikidump

Download required data

Please download the necessary data from the following.


--jawiki-[dump acquisition date]-redirect.sql.gz --jawiki-[dump acquisition date]-page.sql.gz


 $ gunzip jawiki-[dump acquisition date]-redirect.sql.gz
 $ gunzip jawiki-[dump acquisition date]-page.sql.gz

Restore to MySQL database

 $ mysql -u [user name] -p [DB name] <jawiki-[dump acquisition date] -page.sql
 $ mysql -u [user name] -p [DB name] <jawiki-[dump acquisition date] -redirect.sql

2. Redirect extraction

Python code

Code that hits the database to extract redirects and saves them in JSON.

import json
import MySQLdb

 USERNAME = "[MySQL user name]"
 PASSWORD = "[password]"
 DB_NAME = "[DB name]"
OUTPUT = "./redirects.json"

def save_jsonl(file_path, data):
    json_dumps = lambda d:json.dumps(d, ensure_ascii=False)
    dumps = map(json_dumps, data)
    with open(file_path, "w") as f:

if __name__ == '__main__':
 #Connect to database
    conn = MySQLdb.connect(

 #Create Cursor and execute query
    cur = conn.cursor(MySQLdb.cursors.DictCursor)
    sql = "select page.page_title, redirect.rd_title from page, redirect where redirect.rd_from=page.page_id"
    rows = cur.fetchall()

 # Organize execution results
    redirects = []
    for row in rows:
        row = {key:cell.decode() if type(cell) is bytes else cell for key, cell in row.items()}

    save_jsonl(OUTPUT, redirects)



python extract_redirects.py

that's all!

α. Light commentary, etc.

In jawiki-[dump acquisition date]-redirect.sql.gz, the redirect source page_id and the redirect destination title are linked by a record. In jawiki-[dump acquisition date]-page.sql.gz, page_id and title are linked by a record.

By combining these two dumps, the redirect source title and the redirect destination title are linked.

