At PyConJP2016, I learned that recent Python has asyncio. I understood that asyncio makes it possible to use the waiting time efficiently when the waiting time occurs in the program such as I / O processing (probably), so I thought that it could be used to speed up the query to MySQL. Look for other articles for more information on asyncio. When letting MySQL execute a query, I investigated whether there is a difference in processing time between when using a library that supports asyncio and when not using it. Use aiomysql as a library for connecting to MySQL that supports asyncio. aiomysql seems to be a library based on PyMySQL, but since it originally used MySQL-Connector-Python, the comparison target is MySQL-Connector-Python.
In conclusion, there was no big difference in processing time between aiomysql and MySQL-Connector-Python ... Am I making a mistake in the situation using asyncio? Is it stuck in a lock inside MySQL?
This result doesn't conclude that aiomysql is slow, it just shows that it doesn't get faster with Spatial queries (but even with this SELECT query, it gets faster with multi-process parallelization ...).
I am running SQL using MySQL's Spatial Index. The table I used is as follows. This table contains the boundary polygon data [1] of Japanese cities, wards, towns and villages.
create table if not exists {TABLE} (
code mediumint(5) not null,
name varchar(100) not null,
shape multipolygon not null,
center point,
primary key (code),
spatial key shape (shape)
) engine=MyISAM default charset=utf8;
The query I ran is:
select code from {TABLE}
where st_contains(shape, geomfromtext(%s))
A program that reads a TSV file containing latitude and longitude and outputs a matched area.
asyncmatcher.py
# coding: utf-8
import sys
import csv
csv.field_size_limit(10000000000)
import asyncio
import aiomysql
TABLE = 'gxmlcity'
contains_sql = ('SELECT code from {table} '
'WHERE St_Contains(shape, GeomFromText(%s))').format(table=TABLE)
import time
def record_time(func):
def record(*args, **kwargs):
start = time.time()
ret = func(*args, **kwargs)
elapsed_time = time.time() - start
print('Elapsed time: {} [sec]'.format(elapsed_time), file=sys.stderr)
return record
def print_result(cols, result):
if result:
print(*(tuple(cols) + result[0]), sep='\t')
else:
print(*(tuple(cols) + ('No Match',)), sep='\t')
if len(result) > 1:
print(cols, result)
async def match(cur, lat, lon):
p_str = 'POINT({} {})'.format(lat, lon)
await cur.execute(contains_sql, (p_str,))
result = await cur.fetchall()
return result
async def iterate_to_match(cur, args):
for cols in csv.reader(args.infile, delimiter='\t'):
if cols[2] != 'None':
result = await match(cur, float(cols[2]), float(cols[3]))
print_result(cols, result)
async def match_areas(loop, args):
conn = await aiomysql.connect(user='root', password='', db=args.dbname, loop=loop, charset='utf8')
try:
cur = await conn.cursor()
await iterate_to_match(cur, args)
await cur.close()
finally:
conn.close()
def parse_args():
import argparse
parser = argparse.ArgumentParser(description='Asynchronous area matching')
parser.add_argument('--infile', type=argparse.FileType('r', encoding='utf-8'), default=sys.stdin)
parser.add_argument('--dbname', required=True, help='DB name with area DB')
return parser.parse_args()
@record_time
def main():
loop = asyncio.get_event_loop()
loop.run_until_complete(match_areas(loop, args))
loop.close()
if __name__ == '__main__':
args = parse_args()
main()
singlematcher.py
# coding: utf-8
import sys
import csv
csv.field_size_limit(10000000000)
import mysql.connector
TABLE = 'gxmlcity'
contains_sql = ('SELECT code from {table} '
'WHERE St_Contains(shape, GeomFromText(%s))').format(table=TABLE)
import time
def record_time(func):
def record(*args, **kwargs):
start = time.time()
ret = func(*args, **kwargs)
elapsed_time = time.time() - start
print('Elapsed time: {} [sec]'.format(elapsed_time), file=sys.stderr)
return record
def print_result(cols, result):
if result:
print(*(tuple(cols) + result[0]), sep='\t')
else:
print(*(tuple(cols) + ('No Match',)), sep='\t')
if len(result) > 1:
print(cols, result)
def match(cur, lat, lon):
p_str = 'POINT({} {})'.format(lat, lon)
cur.execute(contains_sql, (p_str,))
result = cur.fetchall()
return result
def iterate_to_match(cur, args):
for cols in csv.reader(args.infile, delimiter='\t'):
if cols[2] != 'None':
result = match(cur, float(cols[2]), float(cols[3]))
print_result(cols, result)
def match_areas(args):
conn = mysql.connector.connect(user='root', password='', db=args.dbname, charset='utf8')
try:
cur = conn.cursor()
iterate_to_match(cur, args)
cur.close()
finally:
conn.close()
def parse_args():
import argparse
parser = argparse.ArgumentParser(description='Area matching is usually done')
parser.add_argument('--infile', type=argparse.FileType('r', encoding='utf-8'), default=sys.stdin)
parser.add_argument('--dbname', required=True, help='DB name with area DB')
return parser.parse_args()
@record_time
def main():
match_areas(args)
if __name__ == '__main__':
args = parse_args()
main()
When using aiomysql (those who want to request in parallel)
time ( gzip -dc json_2014-08-01.txt.gz | head -n 1000 | python scripts/asyncmatcher.py --dbname reftest > /dev/null )
Elapsed time: 29.44952368736267 [sec]
real 0m29.581s
user 0m0.736s
sys 0m0.044s
When using mysql-connector-python (usually)
$ time ( gzip -dc json_2014-08-01.txt.gz | head -n 1000 | python scripts/singlematcher.py --dbname reftest > /dev/null )
Elapsed time: 27.986697673797607 [sec]
real 0m28.183s
user 0m0.620s
sys 0m0.024s
Even if I use asyncio, it doesn't get faster ... Is this the right situation to use?
Even though I'm comparing, the Python version is different. MySQL-Connector-This is because Python wasn't compatible with Python 3.5.
[1] Taihei Morikuni, Mitsuo Yoshida, Masayuki Okabe, Kyoji Umemura. Word filtering method for estimating the position of tweet posting. Information Processing Society of Japan Journal Database. 2015, vol. 8, no. 4, p. 16–26.
Recommended Posts