** The Python script, which took a little over 20 minutes to acquire and create match data for one year, has been improved to the point where data for 15 years can be created in 10 minutes. ** **
Please see this repository for the results.
Shinichi-Nakagawa/py-retrosheet
We have succeeded in improving the performance by about 20 times compared to the original!
However, I haven't used a lot of tech so much, so I'm waiting for your opinion.
This story is a sequel to the PyCon JP 2015 baseball story.
This is Qiita's first post. I look forward to working with you.
[Baseball Hack! ~ Data analysis and visualization using Python] at PyCon JP 2015 (http://www.slideshare.net/shinyorke/hackpython-pyconjp "Baseball Hack! ~ Data analysis and visualization using Python") I downloaded and migrated MLB data and analyzed and visualized it using IPython notebook and pandas! I introduced the story.
Among them, "[py-retrosheet](https://github.com/wellsoliver/py-retrosheet" py-retrosheet ")"
** Download MLB match & player performance data "[RETROSHEET](http://retrosheet.org/" retrosheet.org ")" and migrate to your favorite database! ** **
I introduced the library with a doy face, but I found various unpleasant things in this library, although it is not a bug.
Without py-retrosheet, I couldn't make it to PyCon, and I couldn't play with Joey Votto or Adam Dunn. I am very grateful to the authors and contributors of the library, but when I tried using it, I had the following non-negligible problems.
Retrosheet data consists of records of about 190,000 to 200,000 lines per season, but where are you taking a detour?
** It took 20 minutes from downloading the data to creating the database. ** **
It takes 1 minute to create 10,000 rows of data, which is impossible.
If you read the code for a moment, you can do extra scraping, issue a mysterious sql statement (such as throwing a select statement with a unique key check to create a database from a fresh state) for all records, and behave. I used to insert lines one by one, and so on.
** And almost copy and paste! ** **
Also, since I personally want to make Python 3 the main development, I thought that a library that only supports Python 2 commonly known as "Legacy Python" would not be cool, so I decided to improve the library.
So, I immediately forked the original repository and contributed myself. -Nakagawa / py-retrosheet "py-retrosheet") Ready and started to improve. Improvements were made in the following order.
The problem of slow processing is more serious than anything else, and the original order is to start from there, but I decided to aim to understand the contents of the code accurately and find the bottleneck, and first the original code is Python 3 I started from the corresponding place. In addition to grasping the specifications, the test conditions were clear (it is OK if the same movement as the original is done).
After that, I intended to refactor the original code, improve maintainability, and attack the bottleneck of speed delay.
** The code was so crazy that I decided to remake it. ** **
Convert the code using 2to3, which is a standard when migrating from Python 2 to 3, and respond while steadily eliminating Exceptions. Did. By the way, I fixed as much as possible where it doesn't follow pep8 and it seems better (or can be fixed) to fix it.
The hard part was that such an import statement was unavoidably burst.
scripts/download.py
import os
import requests
try:
# Python 3.x
from configparser import ConfigParser
import queue
except ImportError:
# Python 2.x (Ngo I don't want to write...orz)
from ConfigParser import ConfigParser
import Queue as queue
I managed to escape by changing the unit of import and cutting the alias.
After this renovation was completed, I issued Pull request to the original author.
After supporting Python 3, I thought about writing a test and refactoring it, but I was angry and decided to rewrite it because it was difficult to write a test and the code was not good in the first place.
py-retrosheet
It's really simple to do, but I felt that I was doing this simple thing intentionally, so at the same time as reviewing the configuration itself, I added the specifications added from Python 3.5 (because I want to try it personally). Let's make it a utilized library! So I made some improvements.
In addition, the original script was designed to acquire and create data for only one season (one year) per execution, but since there was a request for data for two years or more of the actual problem, we decided to support this as well. ..
First of all, the download looks like this. This is actually almost the same as the original code.
scripts/retrosheet_download.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Download .
Python 3.5.0+ (don't know about 3.4- and 2.x, sorry)
MySQL 5.6.0+ (don't know about 5.5- , sorry)
"""
import os
import click
from configparser import ConfigParser
from classes.fetcher import Fetcher
from queue import Queue
__author__ = 'Shinichi Nakagawa'
class RetrosheetDownload(object):
FILES = (
{
'name': 'eventfiles',
'config_flg': 'dl_eventfiles',
'url': 'eventfiles_url',
'pattern': r'({year})eve\.zip',
'download_url': 'http://www.retrosheet.org/events/{year}eve.zip',
},
{
'name': 'gamelogs',
'config_flg': 'dl_gamelogs',
'url': 'gamelogs_url',
'pattern': r'gl({year})\.zip',
'download_url': 'http://www.retrosheet.org/gamelogs/gl{year}.zip',
}
)
def __init__(self, configfile: str):
"""
initialize
:param configfile: configuration file
"""
# configuration
self.config = ConfigParser()
self.config.read(configfile)
self.num_threads = self.config.getint('download', 'num_threads')
self.path = self.config.get('download', 'directory')
self.absolute_path = os.path.abspath(self.path)
def download(self, queue):
"""
Download & a Archives
:param from_year: Season(from)
:param to_year: Season(to)
:param configfile: Config file
"""
threads = []
for i in range(self.num_threads):
t = Fetcher(queue, self.absolute_path, {'verbose': self.config.get('debug', 'verbose')})
t.start()
threads.append(t)
for thread in threads:
thread.join()
@classmethod
def run(cls, from_year: int, to_year: int, configfile: str):
"""
:param from_year: Season(from)
:param to_year: Season(to)
:param configfile: Config file
"""
client = RetrosheetDownload(configfile)
if not os.path.exists(client.absolute_path):
print("Directory %s does not exist, creating..." % client.absolute_path)
os.makedirs(client.absolute_path)
urls = Queue()
for year in range(from_year, to_year + 1):
for _file in RetrosheetDownload.FILES:
urls.put(_file['download_url'].format(year=year))
client.download(urls)
@click.command()
@click.option('--from_year', '-f', default=2001, help='From Season')
@click.option('--to_year', '-t', default=2014, help='To Season')
@click.option('--configfile', '-c', default='config.ini', help='Config File')
def download(from_year, to_year, configfile):
"""
:param from_year: Season(from)
:param to_year: Season(to)
:param configfile: Config file
"""
# from <= to check
if from_year > to_year:
print('not From <= To({from_year} <= {to_year})'.format(from_year=from_year, to_year=to_year))
raise SystemExit
RetrosheetDownload.run(from_year, to_year, configfile)
if __name__ == '__main__':
download()
I stopped getting unnecessary pages and scraping.
In addition, regarding the problem of "complex handling of arguments around the command line and config", which was a maintenance issue for py-retrosheet in general, the former is click, the latter was solved by aggregating the config acquisition location in the initial processing of the class.
This also applies to data acquisition and creation, which will be introduced later.
Next is the part to get the data using chadwick.
scripts/parse_csv.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Parse to Event Files, Game Logs and Roster.
Python 3.5.0+ (don't know about 3.4- and 2.x, sorry)
"""
import os
import subprocess
import click
from configparser import ConfigParser
__author__ = 'Shinichi Nakagawa'
class ParseCsv(object):
CW_EVENT = '{chadwick_path}cwevent'
CW_GAME = '{chadwick_path}cwgame'
CW_EVENT_CMD = '{chadwick_path}cwevent -q -n -f 0-96 -x 0-62 -y {year} {year}*.EV* > {csvpath}/events-{year}.csv'
CW_GAME_CMD = '{chadwick_path}cwgame -q -n -f 0-83 -y {year} {year}*.EV* > {csvpath}/games-{year}.csv'
EV_FILE_PATTERN = '{path}/{year}*.EV*'
EVENT_FILE = '{csvpath}/events-{year}.csv'
GAME_FILE = '{csvpath}/games-{year}.csv'
CSV_PATH = 'csv'
def __init__(self):
pass
@classmethod
def exists_chadwick(cls, chadwick_path: str):
"""
exists chadwick binary
:param chadwick_path: chadwick path
:return: True or False
"""
if os.path.exists(chadwick_path) \
& os.path.exists(cls.CW_EVENT.format(chadwick_path=chadwick_path)) \
& os.path.exists(cls.CW_GAME.format(chadwick_path=chadwick_path)):
return True
return False
@classmethod
def generate_files(
cls,
year: int,
cmd_format: str,
filename_format: str,
chadwick_path: str,
verbose: bool,
csvpath: str
):
"""
Generate CSV file
:param year: Season
:param cmd_format: Command format
:param filename_format: Filename format
:param chadwick_path: Chadwick Command Path
:param verbose: Debug flg
:param csvpath: csv output path
"""
cmd = cmd_format.format(csvpath=csvpath, year=year, chadwick_path=chadwick_path)
filename = filename_format.format(csvpath=csvpath, year=year)
if os.path.isfile(filename):
os.remove(filename)
if verbose:
print('calling {cmd}'.format(cmd=cmd))
subprocess.call(cmd, shell=True)
@classmethod
def generate_retrosheet_files(
cls,
from_year: int,
to_year: int,
chadwick_path: str,
verbose: str,
csvpath: str
):
"""
Generate CSV file
:param from_year: Season(from)
:param to_year: Season(to)
:param chadwick_path: Chadwick Command Path
:param verbose: Debug flg
:param csvpath: csv output path
"""
# generate files
for year in [year for year in range(from_year, to_year + 1)]:
# game
ParseCsv.generate_files(
year=year,
cmd_format=ParseCsv.CW_GAME_CMD,
filename_format=ParseCsv.GAME_FILE,
chadwick_path=chadwick_path,
verbose=verbose,
csvpath=csvpath
)
# event
ParseCsv.generate_files(
year=year,
cmd_format=ParseCsv.CW_EVENT_CMD,
filename_format=ParseCsv.EVENT_FILE,
chadwick_path=chadwick_path,
verbose=verbose,
csvpath=csvpath
)
@classmethod
def run(cls, from_year: int, to_year: int, configfile: str):
"""
:param from_year: Season(from)
:param to_year: Season(to)
:param configfile: Config file
"""
config = ConfigParser()
config.read(configfile)
verbose = config.getboolean('debug', 'verbose')
chadwick = config.get('chadwick', 'directory')
path = os.path.abspath(config.get('download', 'directory'))
csv_path = '{path}/csv'.format(path=path)
# command exists check
if not cls.exists_chadwick(chadwick):
print('chadwick does not exist in {chadwick} - exiting'.format(chadwick=chadwick))
raise SystemExit
# make directory
os.chdir(path)
if not os.path.exists(ParseCsv.CSV_PATH):
os.makedirs(ParseCsv.CSV_PATH)
# generate files
cls.generate_retrosheet_files(
from_year=from_year,
to_year=to_year,
chadwick_path=chadwick,
verbose=verbose,
csvpath=csv_path
)
# change directory
os.chdir(os.path.dirname(os.path.abspath(__file__)))
@click.command()
@click.option('--from_year', '-f', default=2001, help='From Season')
@click.option('--to_year', '-t', default=2014, help='To Season')
@click.option('--configfile', '-c', default='config.ini', help='Config File')
def create_retrosheet_csv(from_year, to_year, configfile):
"""
:param from_year: Season(from)
:param to_year: Season(to)
:param configfile: Config file
"""
# from <= to check
if from_year > to_year:
print('not From <= To({from_year} <= {to_year})'.format(from_year=from_year, to_year=to_year))
raise SystemExit
ParseCsv.run(from_year, to_year, configfile)
if __name__ == '__main__':
create_retrosheet_csv()
The game data (game log) and the event data (events log) during the game such as at-bats and stolen bases were processed separately, but they were doing the same thing (only the command used and the output destination were different). The process is grouped into methods and the definition is taken from the outside. This is the same as the implementation of data creation.
scripts/retrosheet_mysql.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Migrate Database, Game Logs and Roster.
Python 3.5.0+ (don't know about 3.4- and 2.x, sorry)
MySQL 5.6.0+ (don't know about 5.5- , sorry)
"""
import os
import csv
import click
import sqlalchemy
from glob import glob
from configparser import ConfigParser, NoOptionError
__author__ = 'Shinichi Nakagawa'
class RetrosheetMySql(object):
DATABASE_ENGINE = 'mysql+pymysql'
ENGINE = '{engine}://{user}:{password}@{host}/{database}'
TABLES = (
{
'name': 'teams',
'header': False,
'year': False,
'mask': '{path}/TEAM{year}*',
'select': "SELECT * FROM teams WHERE team_id = '{key_0}'",
'where_index': [0],
'insert': 'INSERT INTO teams VALUES {values}',
},
{
'name': 'rosters',
'header': False,
'year': True,
'mask': '{path}/*{year}*.ROS',
'select': "SELECT * FROM rosters WHERE year = {year} AND player_id = '{key_0}' AND team_tx = '{key_1}'",
'where_index': [0, 5],
'insert': 'INSERT INTO rosters VALUES {values}',
},
{
'name': 'games',
'header': True,
'year': False,
'mask': '{path}/csv/games-{year}*.csv',
'select': "SELECT * FROM games WHERE game_id = '{key_0}'",
'where_index': [0],
'insert': 'INSERT INTO games({columns}) VALUES {values}',
},
{
'name': 'events',
'header': True,
'year': False,
'mask': '{path}/csv/events-{year}*.csv',
'select': "SELECT * FROM events WHERE game_id = '{key_0}' AND event_id = '{key_1}'",
'where_index': [0, 96],
'insert': 'INSERT INTO events({columns}) VALUES {values}',
},
)
def __init__(self, configfile: str):
"""
initialize
:param configfile: configuration file
"""
# configuration
config = ConfigParser()
config.read(configfile)
self.path = os.path.abspath(config.get('download', 'directory'))
self.record_check = config.getboolean('retrosheet_mysql', 'record_check')
self.multiple_insert_rows = config.getint('retrosheet_mysql', 'multiple_insert_rows')
# connection
self.connection = self._generate_connection(config)
@classmethod
def _generate_connection(cls, config: ConfigParser):
"""
generate database connection
:param config: ConfigParser object
:return:
"""
try:
database_engine = cls.DATABASE_ENGINE
database = config.get('database', 'database')
host = config.get('database', 'host')
user = config.get('database', 'user')
password = config.get('database', 'password')
except NoOptionError:
print('Need to define engine, user, password, host, and database parameters')
raise SystemExit
db = sqlalchemy.create_engine(
cls.ENGINE.format(
engine=database_engine,
user=user,
password=password,
host=host,
database=database,
)
)
return db.connect()
@classmethod
def _exists_record(cls, year: int, table: dict, csv_row: list, connection):
where = {'key_{i}'.format(i=i): csv_row[v] for i, v in enumerate(table['where_index'])}
where['year'] = year
sql = table['select'].format(**where)
res = connection.execute(sql)
if res.rowcount > 0:
return True
return False
def _multiple_insert(self, query: str, columns: list, values: list):
params = {
'columns': columns,
'values': ', '.join(values),
}
sql = query.format(**params)
try:
self.connection.execute(sql)
except Exception as e:
print(e)
raise SystemExit
def _create_record(self, year: int, csv_file: str, table: dict):
reader = csv.reader(open(csv_file))
headers = []
values = []
if table['header']:
headers = next(reader)
columns = ', '.join(headers)
for row in reader:
# Record Exists Check
if self.record_check and RetrosheetMySql._exists_record(year, table, row, self.connection):
continue
# append record values
if table['year']: row.insert(0, str(year))
values.append('("{rows}")'.format(rows='", "'.join(row)))
if len(values) == self.multiple_insert_rows:
# inserts
self._multiple_insert(table['insert'], columns, values)
values = []
if len(values) > 0:
# inserts
self._multiple_insert(table['insert'], columns, values)
def execute(self, year: int):
for table in self.TABLES:
for csv_file in glob(table['mask'].format(path=self.path, year=year)):
self._create_record(year, csv_file, table)
@classmethod
def run(cls, from_year: int, to_year: int, configfile: str):
"""
:param from_year: Season(from)
:param to_year: Season(to)
:param configfile: Config file
"""
client = RetrosheetMySql(configfile)
for year in range(from_year, to_year + 1):
client.execute(year)
client.connection.close()
@click.command()
@click.option('--from_year', '-f', default=2001, help='From Season')
@click.option('--to_year', '-t', default=2014, help='To Season')
@click.option('--configfile', '-c', default='config.ini', help='Config File')
def migration(from_year, to_year, configfile):
"""
:param from_year: Season(from)
:param to_year: Season(to)
:param configfile: Config file
"""
# from <= to check
if from_year > to_year:
print('not From <= To({from_year} <= {to_year})'.format(from_year=from_year, to_year=to_year))
raise SystemExit
RetrosheetMySql.run(from_year, to_year, configfile)
if __name__ == '__main__':
migration()
The most improvements in this implementation,
The select statement check is implemented for the time being, but it is not used (it can be used by changing the config setting).
The speed has improved dramatically just by changing the insert statement, which was well-behaved line by line, to 1000 lines (which can also be adjusted with config).
Also, the overall outlook for the code has improved, so it's easy to handle the addition of all-star and playoff data that you haven't used yet!
After improving so far, I tried to benchmark the data of 2014 (about 190,000 lines).
The target is
is. Measurement is
I did it on the condition.
#Before improvement
$ time python download.py -y 2014
Queuing up Event Files for download (2014 only).
Queuing up Game Logs for download (2014 only).
Fetching 2014eve.zip
Fetching gl2014.zip
Zip file detected. Extracting gl2014.zip
Zip file detected. Extracting 2014eve.zip
real 0m5.816s
user 0m0.276s
sys 0m0.066s
$ time python parse.py -y 2014
calling '/usr/local/bin//cwevent -q -n -f 0-96 -x 0-62 -y 2014 2014*.EV* > /Users/shinyorke_mbp/PycharmProjects/py-retrosheet/scripts/files/csv/events-2014.csv'
calling '/usr/local/bin//cwgame -q -n -f 0-83 -y 2014 2014*.EV* > /Users/shinyorke_mbp/PycharmProjects/py-retrosheet/scripts/files/csv/games-2014.csv'
processing TEAM2014
processing ANA2014.ROS
processing ARI2014.ROS
processing ATL2014.ROS
processing BAL2014.ROS
processing BOS2014.ROS
processing CHA2014.ROS
processing CHN2014.ROS
processing CIN2014.ROS
processing CLE2014.ROS
processing COL2014.ROS
processing DET2014.ROS
processing HOU2014.ROS
processing KCA2014.ROS
processing LAN2014.ROS
processing MIA2014.ROS
processing MIL2014.ROS
processing MIN2014.ROS
processing NYA2014.ROS
processing NYN2014.ROS
processing OAK2014.ROS
processing PHI2014.ROS
processing PIT2014.ROS
processing SDN2014.ROS
processing SEA2014.ROS
processing SFN2014.ROS
processing SLN2014.ROS
processing TBA2014.ROS
processing TEX2014.ROS
processing TOR2014.ROS
processing WAS2014.ROS
processing /Users/shinyorke_mbp/PycharmProjects/py-retrosheet/scripts/files/csv/games-2014.csv
processing /Users/shinyorke_mbp/PycharmProjects/py-retrosheet/scripts/files/csv/events-2014.csv
real 19m53.224s
user 13m16.074s
sys 0m26.294s
#After improvement
$ time python retrosheet_download.py -f 2014 -t 2014
Fetching 2014eve.zip
Fetching gl2014.zip
Zip file detected. Extracting gl2014.zip
Zip file detected. Extracting 2014eve.zip
real 0m4.630s
user 0m0.217s
sys 0m0.055s
$ time python parse_csv.py -f 2014 -t 2014
calling /usr/local/bin/cwgame -q -n -f 0-83 -y 2014 2014*.EV* > /Users/shinyorke_mbp/PycharmProjects/py-retrosheet/scripts/files/csv/games-2014.csv
calling /usr/local/bin/cwevent -q -n -f 0-96 -x 0-62 -y 2014 2014*.EV* > /Users/shinyorke_mbp/PycharmProjects/py-retrosheet/scripts/files/csv/events-2014.csv
real 0m8.713s
user 0m8.321s
sys 0m0.221s
$ time python retrosheet_mysql.py -f 2014 -t 2014
real 0m21.435s
user 0m3.580s
sys 0m0.416s
** The process that originally took 20 minutes has been improved to less than 1 minute! ** **
With this, even if you want a large amount of game / turn at bat data, you can easily obtain or make it.
It's getting interesting, so I decided to make data for 15 years from 2000 to 2014.
The total number of records is about 2.9 million, not about 3 million.
I have newly transcribed the script for executing all at once.
scripts/migration.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Migrate Retrosheet Database.
Python 3.5.0+ (don't know about 3.4- and 2.x, sorry)
MySQL 5.6.0+ (don't know about 5.5- , sorry)
"""
import logging
import click
from retrosheet_download import RetrosheetDownload
from retrosheet_mysql import RetrosheetMySql
from parse_csv import ParseCsv
__author__ = 'Shinichi Nakagawa'
@click.command()
@click.option('--from_year', '-f', default=2010, help='From Season')
@click.option('--to_year', '-t', default=2014, help='To Season')
@click.option('--configfile', '-c', default='config.ini', help='Config File')
def main(from_year, to_year, configfile):
"""
:param from_year: Season(from)
:param to_year: Season(to)
:param configfile: Config file
"""
# from <= to check
if from_year > to_year:
print('not From <= To({from_year} <= {to_year})'.format(from_year=from_year, to_year=to_year))
raise SystemExit
# logging setting
logging.basicConfig(
level=logging.INFO,
format="Time:%(asctime)s.%(msecs)03d\t message:%(message)s",
datefmt="%Y-%m-%d %H:%M:%S"
)
# Download
logging.info('File Download Start({from_year}-{to_year})'.format(from_year=from_year, to_year=to_year))
RetrosheetDownload.run(from_year, to_year, configfile)
logging.info('File Download End')
# Parse Csv
logging.info('Csv Parse Start({from_year}-{to_year})'.format(from_year=from_year, to_year=to_year))
ParseCsv.run(from_year, to_year, configfile)
logging.info('Csv Parse End')
# Migrate MySQL Database
logging.info('Migrate Database Start({from_year}-{to_year})'.format(from_year=from_year, to_year=to_year))
RetrosheetMySql.run(from_year, to_year, configfile)
logging.info('Migrate Database End')
if __name__ == '__main__':
main()
Click here for results
$ python migration.py -f 2000 -t 2014
Time:2015-11-15 15:23:48.291 message:File Download Start(2000-2014)
Directory /Users/shinyorke_mbp/PycharmProjects/hatteberg/py-retrosheet/scripts/files does not exist, creating...
Time:2015-11-15 15:23:59.673 message:File Download End
Time:2015-11-15 15:23:59.673 message:Csv Parse Start(2000-2014)
Time:2015-11-15 15:26:37.219 message:Csv Parse End
Time:2015-11-15 15:26:37.220 message:Migrate Database Start(2000-2014)
Time:2015-11-15 15:32:50.070 message:Migrate Database End
** It took less than 10 minutes from data acquisition to completion of migure! !! !! ** **
This will make a little more fun baseball hack!
Recommended Posts