[PYTHON] Extract redirects from Wikipedia dumps

I will leave it as a personal memorandum. I would like to write it as concisely as possible so that I can get the file quickly.

Rendering

Collect Wikipedia redirects and create a file like the one below.

 {"src": "COVID-19", "dst": "New Coronavirus Infection_ (2019)"}
 {"src": "COVID-2019", "dst": "Coronavirus disease _ (2019)"}
 {"src": "Covid-19", "dst": "New Coronavirus Infection_ (2019)"}
 {"src": "Covid-2019", "dst": "Coronavirus disease _ (2019)"}
 {"src": "New Coronavirus Infection", "dst": "New Coronavirus Infection_ (2019)"}
 {"src": "Covid 19", "dst": "New Coronavirus Infection_ (2019)"}
 {"src": "COVID19", "dst": "Coronavirus disease _ (2019)"}
 {"src": "2019 New Coronavirus Infection", "dst": "New Coronavirus Infection_ (2019)"}

What is a redirect?

See below [Wikipedia: Redirect](https://ja.wikipedia.org/wiki/Wikipedia: Redirect)

Redirect example

When I try to access https://ja.wikipedia.org/wiki/COVID-19, You will be automatically skipped to https://ja.wikipedia.org/wiki/New Coronavirus Infection_ (2019).

Implementation etc.

0. Various things you need

1. Restore Wikidump

Download required data

Please download the necessary data from the following.

https://dumps.wikimedia.org/jawiki/

--jawiki-[dump acquisition date]-redirect.sql.gz --jawiki-[dump acquisition date]-page.sql.gz

Defrost

 $ gunzip jawiki-[dump acquisition date]-redirect.sql.gz
 $ gunzip jawiki-[dump acquisition date]-page.sql.gz

Restore to MySQL database

 $ mysql -u [user name] -p [DB name] <jawiki-[dump acquisition date] -page.sql
 $ mysql -u [user name] -p [DB name] <jawiki-[dump acquisition date] -redirect.sql

2. Redirect extraction

Python code

Code that hits the database to extract redirects and saves them in JSON.

import json
import MySQLdb

 USERNAME = "[MySQL user name]"
 PASSWORD = "[password]"
 DB_NAME = "[DB name]"
OUTPUT = "./redirects.json"

def save_jsonl(file_path, data):
    json_dumps = lambda d:json.dumps(d, ensure_ascii=False)
    dumps = map(json_dumps, data)
    with open(file_path, "w") as f:
        f.write("\n".join(dumps))

if __name__ == '__main__':
 #Connect to database
    conn = MySQLdb.connect(
        user=USERNAME,
        passwd=PASSWORD,
        host='localhost',
        db=DB_NAME
    )

 #Create Cursor and execute query
    cur = conn.cursor(MySQLdb.cursors.DictCursor)
    sql = "select page.page_title, redirect.rd_title from page, redirect where redirect.rd_from=page.page_id"
    cur.execute(sql)
    rows = cur.fetchall()

 # Organize execution results
    redirects = []
    for row in rows:
        row = {key:cell.decode() if type(cell) is bytes else cell for key, cell in row.items()}
        redirects.append({
            "src":row["page_title"],
            "dst":row["rd_title"]
        })

 #Save
    save_jsonl(OUTPUT, redirects)

    cur.close()
    conn.close()

Run

python extract_redirects.py

that's all!

α. Light commentary, etc.

In jawiki-[dump acquisition date]-redirect.sql.gz, the redirect source page_id and the redirect destination title are linked by a record. In jawiki-[dump acquisition date]-page.sql.gz, page_id and title are linked by a record.

By combining these two dumps, the redirect source title and the redirect destination title are linked.

Recommended Posts

Extract redirects from Wikipedia dumps
Extract table from wikipedia
Extract a page from a Wikipedia dump
Extract data from S3
Extract features (features) from sentences.
Extract specific languages from Wiktionary
Extract specific data from complex JSON
Extract text from images in Python
How to access wikipedia from python
Extract strings from files in Python