[PYTHON] Write a script to convert a MySQL dump to TSV

Attempts to convert the SQL dumped from MySQL to tab-delimited text (so-called TSV) for other purposes.

Overview

MySQL tables can be dumped as SQL with mysqldump. The main parts are the CREATE and INSERT statements.

(Omitted) CREATE TABLE t_price ( (Omitted) INSERT INTO t_price VALUES (1,'fixed form',25,82),(Omitted)(10,'fixed form外',4000,1180); (Omitted)

You can dump directly to TSV if it is a dump from MySQL that you manage.

If the only data published elsewhere is a SQL dump, you can either grab it and dump it or convert it. This time I will write a script and try the conversion.

specification

Extracts only the VALUES of the INSERT statement from SQL and converts it to TSV. Specifically, we aim for the following behavior.

Before conversion


INSERT INTO `table` VALUES (1,2,3),(4,5,6);
INSERT INTO `table` VALUES (1,'a,b','c\'d');

After conversion


1	2	3
4	5	6
1	a,b	c'd

Python

Write a conversion script in Python.

read_string

Parses the string enclosed in single quotes '. This is the busiest.

Checks if the first character is '. Escape sequence processing is simplified by removing \ for \" , \', \\, but leaving \ for other control characters.

test is a test flag.

sql2tsv.py


test = True

def read_string(src, pos):
    length = len(src)
    if pos >= length or src[pos] != "'": return None
    pos += 1
    ret = ""
    while pos < length and src[pos] != "'":
        if src[pos] == "\\":
            pos += 1
            end = pos >= length
            if end or not src[pos] in "\"'\\":
                ret += "\\"
            if end: break
        ret += src[pos]
        pos += 1
    if pos < length and src[pos] == "'": pos += 1
    return (ret, pos)

if test:
    src = r"'a,b','c\'d'"
    pos = 0
    while pos < len(src):
        s, p = read_string(src, pos)
        print("read_string", (src, pos), "->", (s, p))
        pos = p + 1

Execution result


read_string ("'a,b','c\\'d'", 0) -> ('a,b', 5)
read_string ("'a,b','c\\'d'", 6) -> ("c'd", 12)

read_value

Read one piece of data. For strings, read read_string, otherwise read to, or (.

sql2tsv.py (continued)


def read_value(src, pos):
    length = len(src)
    if pos >= length: return None
    sp = read_string(src, pos)
    if sp: return sp
    p = pos
    while p < length and src[p] != "," and src[p] != ")":
        p += 1
    return (src[pos:p], p)

if test:
    for src in ["1,2,3", r"1,'a,b','c\'d'"]:
        pos = 0
        while (value := read_value(src, pos)):
            s, p = value
            print("read_value", (src, pos), "->", (s, p))
            pos = p + 1

Execution result


read_value ('1,2,3', 0) -> ('1', 1)
read_value ('1,2,3', 2) -> ('2', 3)
read_value ('1,2,3', 4) -> ('3', 5)
read_value ("1,'a,b','c\\'d'", 0) -> ('1', 1)
read_value ("1,'a,b','c\\'d'", 2) -> ('a,b', 7)
read_value ("1,'a,b','c\\'d'", 8) -> ("c'd", 14)

read_values

Reads all comma-separated data in parentheses.

sql2tsv.py (continued)


def read_values(src, pos):
    length = len(src)
    if pos >= length or src[pos] != "(": return None
    pos += 1
    ret = []
    if pos < length and src[pos] != ")":
        while (value := read_value(src, pos)):
            s, pos = value
            ret.append(s)
            if pos >= length or src[pos] != ",": break
            pos += 1
    if pos < length and src[pos] == ")": pos += 1
    return (ret, pos)

if test:
    for src in [r"(1,2,3)", r"(1,'a,b','c\'d')"]:
        print("read_values", (src, 0), "->", read_values(src, 0))

Execution result


read_values ('(1,2,3)', 0) -> (['1', '2', '3'], 7)
read_values ("(1,'a,b','c\\'d')", 0) -> (['1', 'a,b', "c'd"], 16)

read_all_values

Reads all data enclosed in parentheses. I will make it a generator assuming handling in a loop.

sql2tsv.py (continued)


def read_all_values(src, pos):
    length = len(src)
    while (sp := read_values(src, pos)):
        s, pos = sp
        yield s
        if pos >= length or src[pos] != ",": break
        pos += 1

if test:
    src = r"(1,2,3),(1,'a,b','c\'d')"
    print("read_all_values", (src, 0), "->", list(read_all_values(src, 0)))

Execution result


read_all_values ("(1,2,3),(1,'a,b','c\\'d')", 0) -> [['1', '2', '3'], ['1', 'a,b', "c'd"]]

read_sql

Find the line starting with ʻINSERT INTO from each line in SQL and process it with read_all_values`.

sql2tsv.py (continued)


def read_sql(stream):
    while (line := stream.readline()):
        if line.startswith("INSERT INTO "):
            p = line.find("VALUES (")
            if p >= 0: yield from read_all_values(line, p + 7)

if test:
    import io
    src = r"""
INSERT INTO `table` VALUES (1,2,3),(4,5,6);
INSERT INTO `table` VALUES (1,'a,b','c\'d');
""".strip()
    print("read_sql", (src,))
    print("->", list(read_sql(io.StringIO(src))))

Execution result


read_sql ("INSERT INTO `table` VALUES (1,2,3),(4,5,6);\nINSERT INTO `table` VALUES (1,'a,b','c\\'d');",)
-> [['1', '2', '3'], ['4', '5', '6'], ['1', 'a,b', "c'd"]]

Command

I have implemented all the necessary functions. Turn off the test.

sql2tsv.py (change)


test = False

Reads the specified file and writes it to the specified file.

Assuming that the UTF-8 byte string is cut off in the middle of a character, ʻopen specifies ʻerrors =" replace ". This will replace the anomalous characters with. If ʻerrors` is not specified, an error will occur.

[Reference] (Windows) Causes and workarounds of UnicodeEncodeError in Python3 --Qiita

sql2tsv.py (continued)


if __name__ == "__main__":
    import sys
    try:
        sql, tsv = sys.argv[-2], sys.argv[-1]
        if not (sql.endswith(".sql") and tsv.endswith(".tsv")):
            raise Exception
    except:
        print("usage: %s sql tsv" % sys.argv[0])
        exit(1)
    with open(sql, "r", encoding="utf-8", errors="replace") as fr:
        with open(tsv, "w", encoding="utf-8") as fw:
            for values in read_sql(fr):
                fw.write("\t".join(values))
                fw.write("\n")

Use it as follows.

python sql2tsv.py input.sql output.tsv

I put the whole script below.

measurement

It measures how fast it can be processed with a huge file.

Wiktionary Japanese version

As with Wikipedia, dump data is open to the public.

The following files will be used from the May 1, 2020 version available at the time of writing the article.

Extracting the file will be about 90 MB.

gunzip -k jawiktionary-20200501-categorylinks.sql.gz

When the expanded SQL is converted by this script, a TSV of about 79 MB is output.

$ time python sql2tsv.py jawiktionary-20200501-categorylinks.sql jawiktionary-20200501-categorylinks.tsv

real    0m19.921s
user    0m19.375s
sys     0m0.516s

The number of lines is about 950,000.

$ wc -l jawiktionary-20200501-categorylinks.tsv
951077 jawiktionary-20200501-categorylinks.tsv

Wiktionary English version

Then try the English version.

Use the following files. When unfolded, it swells to about 3 GB.

Convert this. The output file is about 2.6 GB.

$ time python sql2tsv.py enwiktionary-20200501-categorylinks.sql enwiktionary-20200501-categorylinks.tsv

real    15m58.965s
user    15m39.063s
sys     0m12.578s

The number of lines is as high as 28 million.

$ wc -l enwiktionary-20200501-categorylinks.tsv
28021874 enwiktionary-20200501-categorylinks.tsv

F#

I ported it to F # as a trial.

Processing the same file was a lot faster.

File Time
jawiktionary-20200501-categorylinks.sql 0m03.168s
enwiktionary-20200501-categorylinks.sql 1m52.396s

Read compressed file

The .NET Framework has a GZipStream that allows you to directly read a gz file that remains compressed.

sql2tsv.fsx (change)


open System.IO.Compression
let args = Environment.GetCommandLineArgs()
let sqlgz, tsv =
    if args.Length < 2 then ("", "") else
    let len = args.Length
    (args.[len - 2], args.[len - 1])
if not (sqlgz.EndsWith ".sql.gz") || Path.GetExtension tsv <> ".tsv" then
    printfn "usage: sql2tsv sql.gz tsv"
    exit 1
do
    use fs = new FileStream(sqlgz, FileMode.Open)
    use gs = new GZipStream(fs, CompressionMode.Decompress)
    use sr = new StreamReader(gs)
    use sw = new StreamWriter(tsv)
    sw.NewLine <- "\n"
    for values in readSql sr do
        sw.WriteLine(String.concat "\t" values)

The measurement results are summarized below.

Summary

Arrange the measurement results.

language File time
Python jawiktionary-20200501-categorylinks.sql 0m19.921s
F# jawiktionary-20200501-categorylinks.sql 0m03.168s
F# jawiktionary-20200501-categorylinks.sql.gz 0m03.509s
Python enwiktionary-20200501-categorylinks.sql 15m58.965s
F# enwiktionary-20200501-categorylinks.sql 1m52.396s
F# enwiktionary-20200501-categorylinks.sql.gz 2m15.380s

Processing the compressed file directly is a bit slower, but it saves disk space, which is an advantage.

Recommended Posts

Write a script to convert a MySQL dump to TSV
Creating a shell script to write a diary
Try to write a ping confirmation script appropriately
I'll never forget how to write a shell script, don't forget! !!
Script to mysqldump to all MySQL DBs
How to write a Python class
Write standard output to a file
Convert a string to an image
A tool to convert Juniper config
Write a batch script with Python3.5 ~
Write a script to calculate the distance with Elasticsearch 5 system painless
Qiita (1) How to write a code name
Execute a script from Jupyter to process
[Python] How to convert a 2D list to a 1D list
How to convert csv to tsv in CLI
[Ubuntu] How to execute a shell script
A simple IDAPython script to name a function
[To Twitter gentlemen] I wrote a script to convert .jpg-large to .jpg at once.
Dump SQLite3 data and migrate to MySQL
I made a script in python to convert .md files to Scrapbox format
I made a script to display emoji
Script to create a Mac dictionary file
How to run a Maya Python script
Convert the cURL API to a Python script (using IBM Cloud object storage)
Python script to convert latitude / longitude to mesh code
Convert a multidimensional list (array) to one dimension
Why does Python have to write a colon?
How to write a ShellScript Bash for statement
I wrote a script to upload a WordPress plugin
How to write a named tuple document in 2020
[Go] How to write or call a function
I want to write to a file with Python
How to write a ShellScript bash case statement
Connect to mysql
How to convert / restore a string with [] in python
How to write a GUI using the maya command
Convert a slice object to a list of index numbers
Convert a text file with hexadecimal values to a binary file
I want to write in Python! (2) Let's write a test
How to convert a class object to a dictionary with SQLAlchemy
I made a script to put a snippet in README.md
How to write a list / dictionary type of Python3
A memorandum to run a python script in a bat file
I made a code to convert illustration2vec to keras model
How to convert a mel spectrogram back to a wav file
[Python] A memo to write CSV vertically with Pandas
A program to write Lattice Hinge with Rhinoceros with Python
[Python] Created a method to convert radix in 1 second
How to create a simple TCP server / client script
Write code to Unit Test a Python web app
Python script to create a JSON file from a CSV file
Output a binary dump in binary and revert to a binary file
Generate a bash script to add Datadog monitor settings
I made a script in Python to convert a text file for JSON (for vscode user snippet)