At PyConJP2016, I learned that recent Python has asyncio. I understood that asyncio makes it possible to use the waiting time efficiently when the waiting time occurs in the program such as I / O processing (probably), so I thought that it could be used to speed up the query to MySQL. Look for other articles for more information on asyncio. When letting MySQL execute a query, I investigated whether there is a difference in processing time between when using a library that supports asyncio and when not using it. Use aiomysql as a library for connecting to MySQL that supports asyncio. aiomysql seems to be a library based on PyMySQL, but since it originally used MySQL-Connector-Python, the comparison target is MySQL-Connector-Python.

Conclusion

In conclusion, there was no big difference in processing time between aiomysql and MySQL-Connector-Python ... Am I making a mistake in the situation using asyncio? Is it stuck in a lock inside MySQL?

This result doesn't conclude that aiomysql is slow, it just shows that it doesn't get faster with Spatial queries (but even with this SELECT query, it gets faster with multi-process parallelization ...).

Database side

I am running SQL using MySQL's Spatial Index. The table I used is as follows. This table contains the boundary polygon data [1] of Japanese cities, wards, towns and villages.

create table if not exists {TABLE} ( 
    code mediumint(5) not null, 
    name varchar(100) not null, 
    shape multipolygon not null, 
    center point, 
    primary key (code), 
    spatial key shape (shape) 
) engine=MyISAM default charset=utf8;

The query I ran is:

select code from {TABLE} 
    where st_contains(shape, geomfromtext(%s))

Program used

A program that reads a TSV file containing latitude and longitude and outputs a matched area.

Programs that use aiomysql

`asyncmatcher.py`


# coding: utf-8

import sys
import csv
csv.field_size_limit(10000000000)
import asyncio
import aiomysql

TABLE = 'gxmlcity'
contains_sql = ('SELECT code from {table} '
                'WHERE St_Contains(shape, GeomFromText(%s))').format(table=TABLE)


import time
def record_time(func):
    def record(*args, **kwargs):
        start = time.time()
        ret = func(*args, **kwargs)
        elapsed_time = time.time() - start
        print('Elapsed time: {} [sec]'.format(elapsed_time), file=sys.stderr)
    return record


def print_result(cols, result):
    if result:
        print(*(tuple(cols) + result[0]), sep='\t')
    else:
        print(*(tuple(cols) + ('No Match',)), sep='\t')
    if len(result) > 1:
        print(cols, result)

async def match(cur, lat, lon):
    p_str = 'POINT({} {})'.format(lat, lon)
    await cur.execute(contains_sql, (p_str,))
    result = await cur.fetchall()
    return result

async def iterate_to_match(cur, args):
    for cols in csv.reader(args.infile, delimiter='\t'):
        if cols[2] != 'None':
            result = await match(cur, float(cols[2]), float(cols[3]))
            print_result(cols, result)

async def match_areas(loop, args):
    conn = await aiomysql.connect(user='root', password='', db=args.dbname, loop=loop, charset='utf8')
    try:
        cur = await conn.cursor()
        await iterate_to_match(cur, args)
        await cur.close()
    finally:
        conn.close()


def parse_args():
    import argparse
    parser = argparse.ArgumentParser(description='Asynchronous area matching')
    parser.add_argument('--infile', type=argparse.FileType('r', encoding='utf-8'), default=sys.stdin)
    parser.add_argument('--dbname', required=True, help='DB name with area DB')
    return parser.parse_args()

@record_time
def main():
    loop = asyncio.get_event_loop()
    loop.run_until_complete(match_areas(loop, args))
    loop.close()

if __name__ == '__main__':
    args = parse_args()
    main()

Programs that use mysql-connector-python

`singlematcher.py`


# coding: utf-8

import sys
import csv
csv.field_size_limit(10000000000)
import mysql.connector


TABLE = 'gxmlcity'
contains_sql = ('SELECT code from {table} '
                'WHERE St_Contains(shape, GeomFromText(%s))').format(table=TABLE)


import time
def record_time(func):
    def record(*args, **kwargs):
        start = time.time()
        ret = func(*args, **kwargs)
        elapsed_time = time.time() - start
        print('Elapsed time: {} [sec]'.format(elapsed_time), file=sys.stderr)
    return record


def print_result(cols, result):
    if result:
        print(*(tuple(cols) + result[0]), sep='\t')
    else:
        print(*(tuple(cols) + ('No Match',)), sep='\t')
    if len(result) > 1:
        print(cols, result)

def match(cur, lat, lon):
    p_str = 'POINT({} {})'.format(lat, lon)
    cur.execute(contains_sql, (p_str,))
    result = cur.fetchall()
    return result

def iterate_to_match(cur, args):
    for cols in csv.reader(args.infile, delimiter='\t'):
        if cols[2] != 'None':
            result = match(cur, float(cols[2]), float(cols[3]))
            print_result(cols, result)

def match_areas(args):
    conn = mysql.connector.connect(user='root', password='', db=args.dbname, charset='utf8')
    try:
        cur = conn.cursor()
        iterate_to_match(cur, args)
        cur.close()
    finally:
        conn.close()


def parse_args():
    import argparse
    parser = argparse.ArgumentParser(description='Area matching is usually done')
    parser.add_argument('--infile', type=argparse.FileType('r', encoding='utf-8'), default=sys.stdin)
    parser.add_argument('--dbname', required=True, help='DB name with area DB')
    return parser.parse_args()

@record_time
def main():
    match_areas(args)

if __name__ == '__main__':
    args = parse_args()
    main()

Comparison

When using aiomysql (those who want to request in parallel)

time ( gzip -dc json_2014-08-01.txt.gz | head -n 1000 | python scripts/asyncmatcher.py --dbname reftest > /dev/null )
Elapsed time: 29.44952368736267 [sec]

real    0m29.581s
user    0m0.736s
sys     0m0.044s

When using mysql-connector-python (usually)

$ time ( gzip -dc json_2014-08-01.txt.gz | head -n 1000 | python scripts/singlematcher.py --dbname reftest > /dev/null )
Elapsed time: 27.986697673797607 [sec]

real    0m28.183s
user    0m0.620s
sys     0m0.024s

Summary

Even if I use asyncio, it doesn't get faster ... Is this the right situation to use?

Experiment environment

Even though I'm comparing, the Python version is different. MySQL-Connector-This is because Python wasn't compatible with Python 3.5.

Debian 8.1
mysql Ver 14.14 Distrib 5.6.25, for Linux (x86_64) using EditLine wrapper

MySQL-Connector-When using Python

Python 3.4.3 :: Continuum Analytics, Inc.
mysql-connector-python 2.0.4

When using aiomysql

Python 3.5.2
aiomysql 0.0.9

References

[1] Taihei Morikuni, Mitsuo Yoshida, Masayuki Okabe, Kyoji Umemura. Word filtering method for estimating the position of tweet posting. Information Processing Society of Japan Journal Database. 2015, vol. 8, no. 4, p. 16–26.

[PYTHON] I tried using aiomysql