Attempts to convert the SQL dumped from MySQL to tab-delimited text (so-called TSV) for other purposes.

Overview

MySQL tables can be dumped as SQL with mysqldump. The main parts are the CREATE and INSERT statements.

Dump by specifying one table | mysqldump textbook

(Omitted) CREATE TABLE t_price ( (Omitted) INSERT INTO t_price VALUES (1,'fixed form',25,82),(Omitted)(10,'fixed form外',4000,1180); (Omitted)

You can dump directly to TSV if it is a dump from MySQL that you manage.

Dump MySQL data in csv, tsv format --Qiita

If the only data published elsewhere is a SQL dump, you can either grab it and dump it or convert it. This time I will write a script and try the conversion.

As will be described later, it is assumed that Wiktionary dumps will be converted. Since the data is huge, I don't want to import it once.

specification

Extracts only the VALUES of the INSERT statement from SQL and converts it to TSV. Specifically, we aim for the following behavior.

`Before conversion`


INSERT INTO `table` VALUES (1,2,3),(4,5,6);
INSERT INTO `table` VALUES (1,'a,b','c\'d');

`After conversion`


1	2	3
4	5	6
1	a,b	c'd

Python

Write a conversion script in Python.

read_string

Parses the string enclosed in single quotes '. This is the busiest.

Checks if the first character is '. Escape sequence processing is simplified by removing \ for \" , \', \\, but leaving \ for other control characters.

test is a test flag.

`sql2tsv.py`


test = True

def read_string(src, pos):
    length = len(src)
    if pos >= length or src[pos] != "'": return None
    pos += 1
    ret = ""
    while pos < length and src[pos] != "'":
        if src[pos] == "\\":
            pos += 1
            end = pos >= length
            if end or not src[pos] in "\"'\\":
                ret += "\\"
            if end: break
        ret += src[pos]
        pos += 1
    if pos < length and src[pos] == "'": pos += 1
    return (ret, pos)

if test:
    src = r"'a,b','c\'d'"
    pos = 0
    while pos < len(src):
        s, p = read_string(src, pos)
        print("read_string", (src, pos), "->", (s, p))
        pos = p + 1

`Execution result`


read_string ("'a,b','c\\'d'", 0) -> ('a,b', 5)
read_string ("'a,b','c\\'d'", 6) -> ("c'd", 12)

read_value

Read one piece of data. For strings, read read_string, otherwise read to, or (.

`sql2tsv.py (continued)`


def read_value(src, pos):
    length = len(src)
    if pos >= length: return None
    sp = read_string(src, pos)
    if sp: return sp
    p = pos
    while p < length and src[p] != "," and src[p] != ")":
        p += 1
    return (src[pos:p], p)

if test:
    for src in ["1,2,3", r"1,'a,b','c\'d'"]:
        pos = 0
        while (value := read_value(src, pos)):
            s, p = value
            print("read_value", (src, pos), "->", (s, p))
            pos = p + 1

`Execution result`


read_value ('1,2,3', 0) -> ('1', 1)
read_value ('1,2,3', 2) -> ('2', 3)
read_value ('1,2,3', 4) -> ('3', 5)
read_value ("1,'a,b','c\\'d'", 0) -> ('1', 1)
read_value ("1,'a,b','c\\'d'", 2) -> ('a,b', 7)
read_value ("1,'a,b','c\\'d'", 8) -> ("c'd", 14)

read_values

Reads all comma-separated data in parentheses.

`sql2tsv.py (continued)`


def read_values(src, pos):
    length = len(src)
    if pos >= length or src[pos] != "(": return None
    pos += 1
    ret = []
    if pos < length and src[pos] != ")":
        while (value := read_value(src, pos)):
            s, pos = value
            ret.append(s)
            if pos >= length or src[pos] != ",": break
            pos += 1
    if pos < length and src[pos] == ")": pos += 1
    return (ret, pos)

if test:
    for src in [r"(1,2,3)", r"(1,'a,b','c\'d')"]:
        print("read_values", (src, 0), "->", read_values(src, 0))

`Execution result`


read_values ('(1,2,3)', 0) -> (['1', '2', '3'], 7)
read_values ("(1,'a,b','c\\'d')", 0) -> (['1', 'a,b', "c'd"], 16)

read_all_values

Reads all data enclosed in parentheses. I will make it a generator assuming handling in a loop.

`sql2tsv.py (continued)`


def read_all_values(src, pos):
    length = len(src)
    while (sp := read_values(src, pos)):
        s, pos = sp
        yield s
        if pos >= length or src[pos] != ",": break
        pos += 1

if test:
    src = r"(1,2,3),(1,'a,b','c\'d')"
    print("read_all_values", (src, 0), "->", list(read_all_values(src, 0)))

`Execution result`


read_all_values ("(1,2,3),(1,'a,b','c\\'d')", 0) -> [['1', '2', '3'], ['1', 'a,b', "c'd"]]

read_sql

Find the line starting with ʻINSERT INTO from each line in SQL and process it with read_all_values`.

`sql2tsv.py (continued)`


def read_sql(stream):
    while (line := stream.readline()):
        if line.startswith("INSERT INTO "):
            p = line.find("VALUES (")
            if p >= 0: yield from read_all_values(line, p + 7)

if test:
    import io
    src = r"""
INSERT INTO `table` VALUES (1,2,3),(4,5,6);
INSERT INTO `table` VALUES (1,'a,b','c\'d');
""".strip()
    print("read_sql", (src,))
    print("->", list(read_sql(io.StringIO(src))))

`Execution result`


read_sql ("INSERT INTO `table` VALUES (1,2,3),(4,5,6);\nINSERT INTO `table` VALUES (1,'a,b','c\\'d');",)
-> [['1', '2', '3'], ['4', '5', '6'], ['1', 'a,b', "c'd"]]

Command

I have implemented all the necessary functions. Turn off the test.

`sql2tsv.py (change)`


test = False

Reads the specified file and writes it to the specified file.

Assuming that the UTF-8 byte string is cut off in the middle of a character, ʻopen specifies ʻerrors =" replace ". This will replace the anomalous characters with. If ʻerrors` is not specified, an error will occur.

[Reference] (Windows) Causes and workarounds of UnicodeEncodeError in Python3 --Qiita

`sql2tsv.py (continued)`


if __name__ == "__main__":
    import sys
    try:
        sql, tsv = sys.argv[-2], sys.argv[-1]
        if not (sql.endswith(".sql") and tsv.endswith(".tsv")):
            raise Exception
    except:
        print("usage: %s sql tsv" % sys.argv[0])
        exit(1)
    with open(sql, "r", encoding="utf-8", errors="replace") as fr:
        with open(tsv, "w", encoding="utf-8") as fw:
            for values in read_sql(fr):
                fw.write("\t".join(values))
                fw.write("\n")

Use it as follows.

python sql2tsv.py input.sql output.tsv

I put the whole script below.

[py] mysqldump (SQL) to TSV converter

measurement

It measures how fast it can be processed with a huge file.

Wiktionary Japanese version

As with Wikipedia, dump data is open to the public.

https://dumps.wikimedia.org/jawiktionary/

The following files will be used from the May 1, 2020 version available at the time of writing the article.

jawiktionary-20200501-categorylinks.sql.gz 10.7 MB

Extracting the file will be about 90 MB.

gunzip -k jawiktionary-20200501-categorylinks.sql.gz

-K is an option to keep the file before extraction.

When the expanded SQL is converted by this script, a TSV of about 79 MB is output.

$ time python sql2tsv.py jawiktionary-20200501-categorylinks.sql jawiktionary-20200501-categorylinks.tsv

real    0m19.921s
user    0m19.375s
sys     0m0.516s

The number of lines is about 950,000.

$ wc -l jawiktionary-20200501-categorylinks.tsv
951077 jawiktionary-20200501-categorylinks.tsv

Wiktionary English version

Then try the English version.

https://dumps.wikimedia.org/enwiktionary/

Use the following files. When unfolded, it swells to about 3 GB.

enwiktionary-20200501-categorylinks.sql.gz 344.6 MB

Convert this. The output file is about 2.6 GB.

$ time python sql2tsv.py enwiktionary-20200501-categorylinks.sql enwiktionary-20200501-categorylinks.tsv

real    15m58.965s
user    15m39.063s
sys     0m12.578s

The number of lines is as high as 28 million.

$ wc -l enwiktionary-20200501-categorylinks.tsv
28021874 enwiktionary-20200501-categorylinks.tsv

I ported it to F # as a trial.

[F#] mysqldump (SQL) to TSV converter

Processing the same file was a lot faster.

File	Time
jawiktionary-20200501-categorylinks.sql	0m03.168s
enwiktionary-20200501-categorylinks.sql	1m52.396s

Read compressed file

The .NET Framework has a GZipStream that allows you to directly read a gz file that remains compressed.

`sql2tsv.fsx (change)`


open System.IO.Compression
let args = Environment.GetCommandLineArgs()
let sqlgz, tsv =
    if args.Length < 2 then ("", "") else
    let len = args.Length
    (args.[len - 2], args.[len - 1])
if not (sqlgz.EndsWith ".sql.gz") || Path.GetExtension tsv <> ".tsv" then
    printfn "usage: sql2tsv sql.gz tsv"
    exit 1
do
    use fs = new FileStream(sqlgz, FileMode.Open)
    use gs = new GZipStream(fs, CompressionMode.Decompress)
    use sr = new StreamReader(gs)
    use sw = new StreamWriter(tsv)
    sw.NewLine <- "\n"
    for values in readSql sr do
        sw.WriteLine(String.concat "\t" values)

The measurement results are summarized below.

Summary

Arrange the measurement results.

language	File	time
Python	jawiktionary-20200501-categorylinks.sql	0m19.921s
F#	jawiktionary-20200501-categorylinks.sql	0m03.168s
F#	jawiktionary-20200501-categorylinks.sql.gz	0m03.509s
Python	enwiktionary-20200501-categorylinks.sql	15m58.965s
F#	enwiktionary-20200501-categorylinks.sql	1m52.396s
F#	enwiktionary-20200501-categorylinks.sql.gz	2m15.380s

Processing the compressed file directly is a bit slower, but it saves disk space, which is an advantage.

[PYTHON] Write a script to convert a MySQL dump to TSV

Overview

specification

Before conversion

After conversion

sql2tsv.py

Execution result

sql2tsv.py (continued)

Execution result

sql2tsv.py (continued)

Execution result

sql2tsv.py (continued)

Execution result

sql2tsv.py (continued)

Execution result

Command

sql2tsv.py (change)

sql2tsv.py (continued)

measurement

Wiktionary Japanese version

Wiktionary English version

Read compressed file

sql2tsv.fsx (change)

Summary

`Before conversion`

`After conversion`

`sql2tsv.py`

`Execution result`

`sql2tsv.py (continued)`

`Execution result`

`sql2tsv.py (continued)`

`Execution result`

`sql2tsv.py (continued)`

`Execution result`

`sql2tsv.py (continued)`

`Execution result`

`sql2tsv.py (change)`

`sql2tsv.py (continued)`

`sql2tsv.fsx (change)`