A little over a month ago, I copied the MLB baseball data script "[py-retrosheet](https://github.com/Shinichi-Nakagawa/py-retrosheet" py-retrosheet ")" compatible with Python 3.5 with golang. It was.
It is a script on Retrosheet (http://www.retrosheet.org/) that downloads match data and results (mainly at-bat data), parses them to a level that can be used for aggregation and analysis, and spits out CSV.
The original has a function to dig into Database (MySQL, etc.), but I will make this later (reason: Ngo who does not want to cut sleep time ... w).
It doesn't matter, but the name "Go! Retrosheet" has a lot of momentum and is cool (reading sticks).
There are two reasons.
In Last #gocon, "I've never done Golang even though I can go, so I made something with Golang. With the motivation of "Let's jump in LT!" [I did A Tour of Go in half a day and rewrote the Django application with Martini and got a wonderful jump in LT](http://shinyorke.hatenablog.com/entry/2015/06/21 / 195656 "Golang I wrote a baseball app half a day later and started LT ~ Go Conference 2015 summer Report #gocon") But when I think about it, I think I haven't written Golang since then (there was PyCon on the way). ..), I'll write this time to commemorate the participation of #gocon in winter! I started with that feeling. LT was thinking about doing it in time.
Also, as a serious reason,
Go parallel processing ~
Go is strong against ◯◯, so baseball is also Go ~
I don't know Go enough to say ... (or rather, a beginner of crunching), so let's reconfirm the goodness and characteristics of Go from sutra copying! The meaning of "learning" is also included.
I pass gofmt and golint, but there are few comments on the whole & it may be appropriate ... Experts, I am waiting for opinions m (_ _) m
Also, I haven't written any tests, so it's a crunchy legacy code.
download.go
This is the code that downloads and decompresses the archive (zip file) containing the CSV of Retrosheet under the files directory.
The code to unzip the zip is written with reference to here & thanks for working properly! But I wonder if I can write smarter? I thought.
Also, I wrote parallel processing in Go for the first time, but I was surprised that I could write it so clearly.
download.go
// Copyright The Shinichi Nakagawa. All rights reserved.
// license that can be found in the LICENSE file.
package main
import (
"archive/zip"
"flag"
"fmt"
"io"
"io/ioutil"
"net/http"
"os"
"path"
"path/filepath"
"sync"
)
const (
// DirName a saved download file directory
DirName = "files"
)
// IsExist return a file exist
func IsExist(filename string) bool {
// Exists check
_, err := os.Stat(filename)
return err == nil
}
// MakeWorkDirectory a make work directory
func MakeWorkDirectory(dirname string) {
if IsExist(dirname) {
return
}
os.MkdirAll(dirname, 0777)
}
// DownloadArchives a download files for retrosheet
func DownloadArchives(url string, dirname string) string {
// get archives
fmt.Println(fmt.Sprintf("download: %s", url))
response, err := http.Get(url)
if err != nil {
fmt.Println(err)
os.Exit(1)
}
fmt.Println(fmt.Sprintf("status: %s", response.Status))
// download
body, err := ioutil.ReadAll(response.Body)
if err != nil {
fmt.Println(err)
os.Exit(1)
}
_, filename := path.Split(url)
fmt.Println(filename)
fullfilename := fmt.Sprintf("%s/%s", dirname, filename)
file, err := os.OpenFile(fullfilename, os.O_CREATE|os.O_WRONLY, 0777)
if err != nil {
fmt.Println(err)
}
defer func() {
file.Close()
}()
file.Write(body)
return fullfilename
}
// Unzip return error a open read & write archives
func Unzip(fullfilename string, outputdirectory string) error {
r, err := zip.OpenReader(fullfilename)
if err != nil {
return err
}
defer r.Close()
for _, f := range r.File {
rc, err := f.Open()
if err != nil {
return err
}
defer rc.Close()
path := filepath.Join(outputdirectory, f.Name)
if f.FileInfo().IsDir() {
os.MkdirAll(path, f.Mode())
} else {
f, err := os.OpenFile(
path, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, f.Mode())
if err != nil {
return err
}
defer f.Close()
_, err = io.Copy(f, rc)
if err != nil {
return err
}
}
}
return nil
}
// GetEventsFileURL return a events file URL
func GetEventsFileURL(year int) string {
return fmt.Sprintf("http://www.retrosheet.org/events/%deve.zip", year)
}
// GetGameLogsURL return a game logs URL
func GetGameLogsURL(year int) string {
return fmt.Sprintf("http://www.retrosheet.org/gamelogs/gl%d.zip", year)
}
func main() {
// Commandline Options
var fromYear = flag.Int("f", 2010, "Season Year(From)")
var toYear = flag.Int("t", 2014, "Season Year(To)")
flag.Parse()
MakeWorkDirectory(DirName)
wait := new(sync.WaitGroup)
// Generate URL
urls := []string{}
for year := *fromYear; year < *toYear+1; year++ {
urls = append(urls, GetEventsFileURL(year))
wait.Add(1)
urls = append(urls, GetGameLogsURL(year))
wait.Add(1)
}
// Download files
for _, url := range urls {
go func(url string) {
fullfilename := DownloadArchives(url, DirName)
err := Unzip(fullfilename, DirName)
if err != nil {
fmt.Println(err)
os.Exit(1)
}
wait.Done()
}(url)
}
wait.Wait()
}
dataset.go
It is a script to create a dataset by hitting the command of the library dedicated to score data called chadwick.
I had a hard time executing the "cwgame" and "cwevent" commands implemented in the chadwick library.
It is the same as the Python version up to the point where the command is executed with Change Directory (argument in / out is a relative path), but the Path setting is confused while writing, and there are many arguments of the string format (this) Is my design mistake, but I didn't know how to write Python "" {hoge} ".format (hoge =" fuga ")"), and I struggled with boring addiction.
dataset.go
// Copyright The Shinichi Nakagawa. All rights reserved.
// license that can be found in the LICENSE file.
package main
import (
"flag"
"fmt"
"os"
"os/exec"
"path/filepath"
"sync"
)
const (
// ProjectRootDir : Project Root
ProjectRootDir = "."
// InputDirName : inputfile directory
InputDirName = "./files"
// OutputDirName : outputfile directory
OutputDirName = "./csv"
// CwPath : chadwick path
CwPath = "/usr/local/bin"
// CwEvent : cwevent command
CwEvent = "%s/cwevent -q -n -f 0-96 -x 0-62 -y %d ./%d*.EV* > %s/events-%d.csv"
// CwGame : cwgame command
CwGame = "%s/cwgame -q -n -f 0-83 -y %d ./%d*.EV* > %s/games-%d.csv"
)
// ParseCsv a parse to eventfile(output:csv file)
func ParseCsv(command string, rootDir string, inputDir string) {
os.Chdir(inputDir)
out, err := exec.Command("sh", "-c", command).Output()
if err != nil {
fmt.Println(err)
os.Exit(1)
}
fmt.Println(string(out))
os.Chdir(rootDir)
}
func main() {
// Commandline Options
var fromYear = flag.Int("f", 2010, "Season Year(From)")
var toYear = flag.Int("t", 2014, "Season Year(To)")
flag.Parse()
// path
rootDir, _ := filepath.Abs(ProjectRootDir)
inputDir, _ := filepath.Abs(InputDirName)
outputDir := OutputDirName
wait := new(sync.WaitGroup)
// Generate URL
commandList := []string{}
for year := *fromYear; year < *toYear+1; year++ {
commandList = append(commandList, fmt.Sprintf(CwEvent, CwPath, year, year, outputDir, year))
wait.Add(1)
commandList = append(commandList, fmt.Sprintf(CwGame, CwPath, year, year, outputDir, year))
wait.Add(1)
}
for _, command := range commandList {
fmt.Println(command)
go func(command string) {
ParseCsv(command, rootDir, inputDir)
wait.Done()
}(command)
}
wait.Wait()
}
Now that I can create a dataset (CSV) as I want, I think it worked.
There is room for improvement in design and configuration, but I feel that it can be written more clearly than the Python version.
By the way, the editor used the IntelliJ IDEA Go plugin.
I used it with the intention of using it because I'm using it with a donation, including PyCharm, but it's still good. Go will continue to go with IntelliJ.
In addition, I felt that I had just released v0.1 earlier, so I stopped jumping in LT (disappointing)
I'm going to do an implementation that pushes the dataset into MySQL.
After that, I compared it with the Python version on the bench, and since I have never used Go on Docker, I will study it with this library.
Recommended Posts