Attempts to convert the SQL dumped from MySQL to tab-delimited text (so-called TSV) for other purposes.
MySQL tables can be dumped as SQL with mysqldump. The main parts are the CREATE and INSERT statements.
(Omitted)
CREATE TABLE t_price
(
(Omitted)
INSERT INTO t_price
VALUES (1,'fixed form',25,82),(Omitted)(10,'fixed form外',4000,1180);
(Omitted)
You can dump directly to TSV if it is a dump from MySQL that you manage.
If the only data published elsewhere is a SQL dump, you can either grab it and dump it or convert it. This time I will write a script and try the conversion.
Extracts only the VALUES of the INSERT statement from SQL and converts it to TSV. Specifically, we aim for the following behavior.
Before conversion
INSERT INTO `table` VALUES (1,2,3),(4,5,6);
INSERT INTO `table` VALUES (1,'a,b','c\'d');
After conversion
1 2 3
4 5 6
1 a,b c'd
Python
Write a conversion script in Python.
read_string
Parses the string enclosed in single quotes '
. This is the busiest.
Checks if the first character is '
. Escape sequence processing is simplified by removing \
for \"
, \'
, \\
, but leaving \
for other control characters.
test
is a test flag.
sql2tsv.py
test = True
def read_string(src, pos):
length = len(src)
if pos >= length or src[pos] != "'": return None
pos += 1
ret = ""
while pos < length and src[pos] != "'":
if src[pos] == "\\":
pos += 1
end = pos >= length
if end or not src[pos] in "\"'\\":
ret += "\\"
if end: break
ret += src[pos]
pos += 1
if pos < length and src[pos] == "'": pos += 1
return (ret, pos)
if test:
src = r"'a,b','c\'d'"
pos = 0
while pos < len(src):
s, p = read_string(src, pos)
print("read_string", (src, pos), "->", (s, p))
pos = p + 1
Execution result
read_string ("'a,b','c\\'d'", 0) -> ('a,b', 5)
read_string ("'a,b','c\\'d'", 6) -> ("c'd", 12)
read_value
Read one piece of data. For strings, read read_string
, otherwise read to,
or (
.
sql2tsv.py (continued)
def read_value(src, pos):
length = len(src)
if pos >= length: return None
sp = read_string(src, pos)
if sp: return sp
p = pos
while p < length and src[p] != "," and src[p] != ")":
p += 1
return (src[pos:p], p)
if test:
for src in ["1,2,3", r"1,'a,b','c\'d'"]:
pos = 0
while (value := read_value(src, pos)):
s, p = value
print("read_value", (src, pos), "->", (s, p))
pos = p + 1
Execution result
read_value ('1,2,3', 0) -> ('1', 1)
read_value ('1,2,3', 2) -> ('2', 3)
read_value ('1,2,3', 4) -> ('3', 5)
read_value ("1,'a,b','c\\'d'", 0) -> ('1', 1)
read_value ("1,'a,b','c\\'d'", 2) -> ('a,b', 7)
read_value ("1,'a,b','c\\'d'", 8) -> ("c'd", 14)
read_values
Reads all comma-separated data in parentheses.
sql2tsv.py (continued)
def read_values(src, pos):
length = len(src)
if pos >= length or src[pos] != "(": return None
pos += 1
ret = []
if pos < length and src[pos] != ")":
while (value := read_value(src, pos)):
s, pos = value
ret.append(s)
if pos >= length or src[pos] != ",": break
pos += 1
if pos < length and src[pos] == ")": pos += 1
return (ret, pos)
if test:
for src in [r"(1,2,3)", r"(1,'a,b','c\'d')"]:
print("read_values", (src, 0), "->", read_values(src, 0))
Execution result
read_values ('(1,2,3)', 0) -> (['1', '2', '3'], 7)
read_values ("(1,'a,b','c\\'d')", 0) -> (['1', 'a,b', "c'd"], 16)
read_all_values
Reads all data enclosed in parentheses. I will make it a generator assuming handling in a loop.
sql2tsv.py (continued)
def read_all_values(src, pos):
length = len(src)
while (sp := read_values(src, pos)):
s, pos = sp
yield s
if pos >= length or src[pos] != ",": break
pos += 1
if test:
src = r"(1,2,3),(1,'a,b','c\'d')"
print("read_all_values", (src, 0), "->", list(read_all_values(src, 0)))
Execution result
read_all_values ("(1,2,3),(1,'a,b','c\\'d')", 0) -> [['1', '2', '3'], ['1', 'a,b', "c'd"]]
read_sql
Find the line starting with ʻINSERT INTO from each line in SQL and process it with
read_all_values`.
sql2tsv.py (continued)
def read_sql(stream):
while (line := stream.readline()):
if line.startswith("INSERT INTO "):
p = line.find("VALUES (")
if p >= 0: yield from read_all_values(line, p + 7)
if test:
import io
src = r"""
INSERT INTO `table` VALUES (1,2,3),(4,5,6);
INSERT INTO `table` VALUES (1,'a,b','c\'d');
""".strip()
print("read_sql", (src,))
print("->", list(read_sql(io.StringIO(src))))
Execution result
read_sql ("INSERT INTO `table` VALUES (1,2,3),(4,5,6);\nINSERT INTO `table` VALUES (1,'a,b','c\\'d');",)
-> [['1', '2', '3'], ['4', '5', '6'], ['1', 'a,b', "c'd"]]
I have implemented all the necessary functions. Turn off the test.
sql2tsv.py (change)
test = False
Reads the specified file and writes it to the specified file.
Assuming that the UTF-8 byte string is cut off in the middle of a character, ʻopen specifies ʻerrors =" replace "
. This will replace the anomalous characters with. If ʻerrors` is not specified, an error will occur.
[Reference] (Windows) Causes and workarounds of UnicodeEncodeError in Python3 --Qiita
sql2tsv.py (continued)
if __name__ == "__main__":
import sys
try:
sql, tsv = sys.argv[-2], sys.argv[-1]
if not (sql.endswith(".sql") and tsv.endswith(".tsv")):
raise Exception
except:
print("usage: %s sql tsv" % sys.argv[0])
exit(1)
with open(sql, "r", encoding="utf-8", errors="replace") as fr:
with open(tsv, "w", encoding="utf-8") as fw:
for values in read_sql(fr):
fw.write("\t".join(values))
fw.write("\n")
Use it as follows.
python sql2tsv.py input.sql output.tsv
I put the whole script below.
It measures how fast it can be processed with a huge file.
As with Wikipedia, dump data is open to the public.
The following files will be used from the May 1, 2020 version available at the time of writing the article.
Extracting the file will be about 90 MB.
gunzip -k jawiktionary-20200501-categorylinks.sql.gz
-K
is an option to keep the file before extraction.When the expanded SQL is converted by this script, a TSV of about 79 MB is output.
$ time python sql2tsv.py jawiktionary-20200501-categorylinks.sql jawiktionary-20200501-categorylinks.tsv
real 0m19.921s
user 0m19.375s
sys 0m0.516s
The number of lines is about 950,000.
$ wc -l jawiktionary-20200501-categorylinks.tsv
951077 jawiktionary-20200501-categorylinks.tsv
Then try the English version.
Use the following files. When unfolded, it swells to about 3 GB.
Convert this. The output file is about 2.6 GB.
$ time python sql2tsv.py enwiktionary-20200501-categorylinks.sql enwiktionary-20200501-categorylinks.tsv
real 15m58.965s
user 15m39.063s
sys 0m12.578s
The number of lines is as high as 28 million.
$ wc -l enwiktionary-20200501-categorylinks.tsv
28021874 enwiktionary-20200501-categorylinks.tsv
F#
I ported it to F # as a trial.
Processing the same file was a lot faster.
File | Time |
---|---|
jawiktionary-20200501-categorylinks.sql | 0m03.168s |
enwiktionary-20200501-categorylinks.sql | 1m52.396s |
The .NET Framework has a GZipStream that allows you to directly read a gz file that remains compressed.
sql2tsv.fsx (change)
open System.IO.Compression
let args = Environment.GetCommandLineArgs()
let sqlgz, tsv =
if args.Length < 2 then ("", "") else
let len = args.Length
(args.[len - 2], args.[len - 1])
if not (sqlgz.EndsWith ".sql.gz") || Path.GetExtension tsv <> ".tsv" then
printfn "usage: sql2tsv sql.gz tsv"
exit 1
do
use fs = new FileStream(sqlgz, FileMode.Open)
use gs = new GZipStream(fs, CompressionMode.Decompress)
use sr = new StreamReader(gs)
use sw = new StreamWriter(tsv)
sw.NewLine <- "\n"
for values in readSql sr do
sw.WriteLine(String.concat "\t" values)
The measurement results are summarized below.
Arrange the measurement results.
language | File | time |
---|---|---|
Python | jawiktionary-20200501-categorylinks.sql | 0m19.921s |
F# | jawiktionary-20200501-categorylinks.sql | 0m03.168s |
F# | jawiktionary-20200501-categorylinks.sql.gz | 0m03.509s |
Python | enwiktionary-20200501-categorylinks.sql | 15m58.965s |
F# | enwiktionary-20200501-categorylinks.sql | 1m52.396s |
F# | enwiktionary-20200501-categorylinks.sql.gz | 2m15.380s |
Processing the compressed file directly is a bit slower, but it saves disk space, which is an advantage.
Recommended Posts