I will leave it as a personal memorandum. I would like to write it as concisely as possible so that I can get the file quickly.

Rendering

Collect Wikipedia redirects and create a file like the one below.

 {"src": "COVID-19", "dst": "New Coronavirus Infection_ (2019)"}
 {"src": "COVID-2019", "dst": "Coronavirus disease _ (2019)"}
 {"src": "Covid-19", "dst": "New Coronavirus Infection_ (2019)"}
 {"src": "Covid-2019", "dst": "Coronavirus disease _ (2019)"}
 {"src": "New Coronavirus Infection", "dst": "New Coronavirus Infection_ (2019)"}
 {"src": "Covid 19", "dst": "New Coronavirus Infection_ (2019)"}
 {"src": "COVID19", "dst": "Coronavirus disease _ (2019)"}
 {"src": "2019 New Coronavirus Infection", "dst": "New Coronavirus Infection_ (2019)"}

What is a redirect?

See below [Wikipedia: Redirect](https://ja.wikipedia.org/wiki/Wikipedia: Redirect)

Redirect example

When I try to access https://ja.wikipedia.org/wiki/COVID-19, You will be automatically skipped to https://ja.wikipedia.org/wiki/New Coronavirus Infection_ (2019).

Implementation etc.

0. Various things you need

MySQL
Python3
mysqlclient
pip install mysqlclient

1. Restore Wikidump

Download required data

Please download the necessary data from the following.

https://dumps.wikimedia.org/jawiki/

--jawiki-[dump acquisition date]-redirect.sql.gz --jawiki-[dump acquisition date]-page.sql.gz

Defrost

 $ gunzip jawiki-[dump acquisition date]-redirect.sql.gz
 $ gunzip jawiki-[dump acquisition date]-page.sql.gz

Restore to MySQL database

 $ mysql -u [user name] -p [DB name] <jawiki-[dump acquisition date] -page.sql
 $ mysql -u [user name] -p [DB name] <jawiki-[dump acquisition date] -redirect.sql

2. Redirect extraction

Python code

Code that hits the database to extract redirects and saves them in JSON.

import json
import MySQLdb

 USERNAME = "[MySQL user name]"
 PASSWORD = "[password]"
 DB_NAME = "[DB name]"
OUTPUT = "./redirects.json"

def save_jsonl(file_path, data):
    json_dumps = lambda d:json.dumps(d, ensure_ascii=False)
    dumps = map(json_dumps, data)
    with open(file_path, "w") as f:
        f.write("\n".join(dumps))

if __name__ == '__main__':
 #Connect to database
    conn = MySQLdb.connect(
        user=USERNAME,
        passwd=PASSWORD,
        host='localhost',
        db=DB_NAME
    )

 #Create Cursor and execute query
    cur = conn.cursor(MySQLdb.cursors.DictCursor)
    sql = "select page.page_title, redirect.rd_title from page, redirect where redirect.rd_from=page.page_id"
    cur.execute(sql)
    rows = cur.fetchall()

 # Organize execution results
    redirects = []
    for row in rows:
        row = {key:cell.decode() if type(cell) is bytes else cell for key, cell in row.items()}
        redirects.append({
            "src":row["page_title"],
            "dst":row["rd_title"]
        })

 #Save
    save_jsonl(OUTPUT, redirects)

    cur.close()
    conn.close()

Run

python extract_redirects.py

that's all!

α. Light commentary, etc.

Prior knowledge: Wikipedia pages are individually assigned page_id in addition to title.

In jawiki-[dump acquisition date]-redirect.sql.gz, the redirect source page_id and the redirect destination title are linked by a record. In jawiki-[dump acquisition date]-page.sql.gz, page_id and title are linked by a record.

By combining these two dumps, the redirect source title and the redirect destination title are linked.

[PYTHON] Extract redirects from Wikipedia dumps