[Baseball Hack] I tried copying the Python MLB score & grade data acquisition script with Go in half a day

What i did

A little over a month ago, I copied the MLB baseball data script "[py-retrosheet](https://github.com/Shinichi-Nakagawa/py-retrosheet" py-retrosheet ")" compatible with Python 3.5 with golang. It was.

go-retrosheet

It is a script on Retrosheet (http://www.retrosheet.org/) that downloads match data and results (mainly at-bat data), parses them to a level that can be used for aggregation and analysis, and spits out CSV.

The original has a function to dig into Database (MySQL, etc.), but I will make this later (reason: Ngo who does not want to cut sleep time ... w).

It doesn't matter, but the name "Go! Retrosheet" has a lot of momentum and is cool (reading sticks).

Why did you decide to make it with Golang

There are two reasons.

In Last #gocon, "I've never done Golang even though I can go, so I made something with Golang. With the motivation of "Let's jump in LT!" [I did A Tour of Go in half a day and rewrote the Django application with Martini and got a wonderful jump in LT](http://shinyorke.hatenablog.com/entry/2015/06/21 / 195656 "Golang I wrote a baseball app half a day later and started LT ~ Go Conference 2015 summer Report #gocon") But when I think about it, I think I haven't written Golang since then (there was PyCon on the way). ..), I'll write this time to commemorate the participation of #gocon in winter! I started with that feeling. LT was thinking about doing it in time.

Also, as a serious reason,

I don't know Go enough to say ... (or rather, a beginner of crunching), so let's reconfirm the goodness and characteristics of Go from sutra copying! The meaning of "learning" is also included.

What I made

I pass gofmt and golint, but there are few comments on the whole & it may be appropriate ... Experts, I am waiting for opinions m (_ _) m

Also, I haven't written any tests, so it's a crunchy legacy code.

download.go

This is the code that downloads and decompresses the archive (zip file) containing the CSV of Retrosheet under the files directory.

The code to unzip the zip is written with reference to here & thanks for working properly! But I wonder if I can write smarter? I thought.

Also, I wrote parallel processing in Go for the first time, but I was surprised that I could write it so clearly.

download.go


// Copyright  The Shinichi Nakagawa. All rights reserved.
// license that can be found in the LICENSE file.

package main

import (
	"archive/zip"
	"flag"
	"fmt"
	"io"
	"io/ioutil"
	"net/http"
	"os"
	"path"
	"path/filepath"
	"sync"
)

const (
	// DirName a saved download file directory
	DirName = "files"
)

// IsExist return a file exist
func IsExist(filename string) bool {

	// Exists check
	_, err := os.Stat(filename)
	return err == nil
}

// MakeWorkDirectory a make work directory
func MakeWorkDirectory(dirname string) {
	if IsExist(dirname) {
		return
	}
	os.MkdirAll(dirname, 0777)
}

// DownloadArchives a download files for retrosheet
func DownloadArchives(url string, dirname string) string {

	// get archives
	fmt.Println(fmt.Sprintf("download: %s", url))
	response, err := http.Get(url)
	if err != nil {
		fmt.Println(err)
		os.Exit(1)
	}
	fmt.Println(fmt.Sprintf("status: %s", response.Status))

	// download
	body, err := ioutil.ReadAll(response.Body)
	if err != nil {
		fmt.Println(err)
		os.Exit(1)
	}
	_, filename := path.Split(url)
	fmt.Println(filename)
	fullfilename := fmt.Sprintf("%s/%s", dirname, filename)
	file, err := os.OpenFile(fullfilename, os.O_CREATE|os.O_WRONLY, 0777)
	if err != nil {
		fmt.Println(err)
	}
	defer func() {
		file.Close()
	}()

	file.Write(body)

	return fullfilename

}

// Unzip return error a open read & write archives
func Unzip(fullfilename string, outputdirectory string) error {
	r, err := zip.OpenReader(fullfilename)
	if err != nil {
		return err
	}
	defer r.Close()
	for _, f := range r.File {
		rc, err := f.Open()
		if err != nil {
			return err
		}
		defer rc.Close()

		path := filepath.Join(outputdirectory, f.Name)
		if f.FileInfo().IsDir() {
			os.MkdirAll(path, f.Mode())
		} else {
			f, err := os.OpenFile(
				path, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, f.Mode())
			if err != nil {
				return err
			}
			defer f.Close()

			_, err = io.Copy(f, rc)
			if err != nil {
				return err
			}
		}
	}
	return nil
}

// GetEventsFileURL return a events file URL
func GetEventsFileURL(year int) string {
	return fmt.Sprintf("http://www.retrosheet.org/events/%deve.zip", year)
}

// GetGameLogsURL return a game logs URL
func GetGameLogsURL(year int) string {
	return fmt.Sprintf("http://www.retrosheet.org/gamelogs/gl%d.zip", year)
}

func main() {
	// Commandline Options
	var fromYear = flag.Int("f", 2010, "Season Year(From)")
	var toYear = flag.Int("t", 2014, "Season Year(To)")
	flag.Parse()

	MakeWorkDirectory(DirName)

	wait := new(sync.WaitGroup)
	// Generate URL
	urls := []string{}
	for year := *fromYear; year < *toYear+1; year++ {
		urls = append(urls, GetEventsFileURL(year))
		wait.Add(1)
		urls = append(urls, GetGameLogsURL(year))
		wait.Add(1)
	}

	// Download files
	for _, url := range urls {
		go func(url string) {
			fullfilename := DownloadArchives(url, DirName)
			err := Unzip(fullfilename, DirName)
			if err != nil {
				fmt.Println(err)
				os.Exit(1)
			}
			wait.Done()
		}(url)
	}
	wait.Wait()

}

dataset.go

It is a script to create a dataset by hitting the command of the library dedicated to score data called chadwick.

I had a hard time executing the "cwgame" and "cwevent" commands implemented in the chadwick library.

It is the same as the Python version up to the point where the command is executed with Change Directory (argument in / out is a relative path), but the Path setting is confused while writing, and there are many arguments of the string format (this) Is my design mistake, but I didn't know how to write Python "" {hoge} ".format (hoge =" fuga ")"), and I struggled with boring addiction.

dataset.go


// Copyright  The Shinichi Nakagawa. All rights reserved.
// license that can be found in the LICENSE file.

package main

import (
	"flag"
	"fmt"
	"os"
	"os/exec"
	"path/filepath"
	"sync"
)

const (
	// ProjectRootDir : Project Root
	ProjectRootDir = "."
	// InputDirName : inputfile directory
	InputDirName = "./files"
	// OutputDirName : outputfile directory
	OutputDirName = "./csv"
	// CwPath : chadwick path
	CwPath = "/usr/local/bin"
	// CwEvent : cwevent command
	CwEvent = "%s/cwevent -q -n -f 0-96 -x 0-62 -y %d ./%d*.EV* > %s/events-%d.csv"
	// CwGame : cwgame command
	CwGame = "%s/cwgame -q -n -f 0-83 -y %d ./%d*.EV* > %s/games-%d.csv"
)

// ParseCsv a parse to eventfile(output:csv file)
func ParseCsv(command string, rootDir string, inputDir string) {
	os.Chdir(inputDir)
	out, err := exec.Command("sh", "-c", command).Output()

	if err != nil {
		fmt.Println(err)
		os.Exit(1)
	}
	fmt.Println(string(out))
	os.Chdir(rootDir)

}

func main() {
	// Commandline Options
	var fromYear = flag.Int("f", 2010, "Season Year(From)")
	var toYear = flag.Int("t", 2014, "Season Year(To)")
	flag.Parse()

	// path
	rootDir, _ := filepath.Abs(ProjectRootDir)
	inputDir, _ := filepath.Abs(InputDirName)
	outputDir := OutputDirName

	wait := new(sync.WaitGroup)
	// Generate URL
	commandList := []string{}
	for year := *fromYear; year < *toYear+1; year++ {
		commandList = append(commandList, fmt.Sprintf(CwEvent, CwPath, year, year, outputDir, year))
		wait.Add(1)
		commandList = append(commandList, fmt.Sprintf(CwGame, CwPath, year, year, outputDir, year))
		wait.Add(1)
	}
	for _, command := range commandList {
		fmt.Println(command)
		go func(command string) {
			ParseCsv(command, rootDir, inputDir)
			wait.Done()
		}(command)
	}
	wait.Wait()

}

Achievements and impressions

Now that I can create a dataset (CSV) as I want, I think it worked.

There is room for improvement in design and configuration, but I feel that it can be written more clearly than the Python version.

By the way, the editor used the IntelliJ IDEA Go plugin.

I used it with the intention of using it because I'm using it with a donation, including PyCharm, but it's still good. Go will continue to go with IntelliJ.

In addition, I felt that I had just released v0.1 earlier, so I stopped jumping in LT (disappointing)

home work

I'm going to do an implementation that pushes the dataset into MySQL.

After that, I compared it with the Python version on the bench, and since I have never used Go on Docker, I will study it with this library.

Recommended Posts

[Baseball Hack] I tried copying the Python MLB score & grade data acquisition script with Go in half a day
A Python script that stores 15 years of MLB game data in MySQL in 10 minutes (Baseball Hack!)
I replaced the Windows PowerShell cookbook with a python script.
I tried to create a Python script to get the value of a cell in Microsoft Excel
I also tried to imitate the function monad and State monad with a generator in Python
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
A story that didn't work when I tried to log in with the Python requests module
I tried to open the latest data of the Excel file managed by date in the folder with Python
Run the Python interpreter in a script
I tried a functional language with Python
[Python & SQLite] I tried to analyze the expected value of a race with horses in the 1x win range ①
Introduction to AI creation with Python! Part 2 I tried to predict the house price in Boston with a neural network
I tried "smoothing" the image with Python + OpenCV
I tried "differentiating" the image with Python + OpenCV
I tried to save the data with discord
I tried playing a typing game in Python
I tried simulating the "birthday paradox" in Python
I tried the least squares method in Python
I tried to get CloudWatch data with Python
[Memo] I tried a pivot table in Python
I tried "binarizing" the image with Python + OpenCV
I tried running faiss with python, Go, Rust
I tried adding a Python3 module in C
I made a Python program for Raspberry Pi that operates Omron's environmental sensor in the mode with data storage
I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 2)
I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 1)
I created a stacked bar graph with matplotlib in Python and added a data label
[New Corona] Is the next peak in December? I tried trend analysis with Python!
I tried scraping food recall information with Python to create a pandas data frame
I tried to make a function to judge whether the major stock exchanges in the world are daylight saving time with python
I tried to graph the packages installed in Python
I tried LeetCode every day 7. Reverse Integer (Python, Go)
I tried to touch the CSV file with Python
I tried to draw a route map with Python
I tried to solve the soma cube with python
I tried LeetCode every day 112. Path Sum (Python, Go)
I tried to implement a pseudo pachislot in Python
I tried LeetCode every day 20. Valid Parentheses (Python, Go)
I want to work with a robot in python.
I tried LeetCode every day 136. Single Number (Python, Go)
I tried to automatically generate a password with Python3
I tried LeetCode every day 118. Pascal's Triangle (Python, Go)
A memo that I touched the Datastore with python
I tried LeetCode every day 125. Valid Palindrome (Python, Go)
I tried collecting data from a website with Scrapy
I tried LeetCode every day 155. Min Stack (Python, Go)
I tried to solve the problem with Python Vol.1
I tried to analyze J League data with Python
I tried LeetCode every day 9. Palindrome Number (Python, Go)
I tried LeetCode every day 1. Two Sum (Python, Go)
I tried hitting the API with echonest's python client
I tried the same data analysis with kaggle notebook (python) and Power BI at the same time ②
I made a class to get the analysis result by MeCab in ndarray with python
I tried the same data analysis with kaggle notebook (python) and Power BI at the same time ①
I tried running the offline speech recognition system Julius with python in the Docker virtual environment
I tried to find the entropy of the image with python
Try scraping the data of COVID-19 in Tokyo with Python
I tried "gamma correction" of the image with Python + OpenCV
I tried to simulate how the infection spreads with Python
I made a simple typing game with tkinter in Python
I tried the accuracy of three Stirling's approximations in python