Overview

There are many things to worry about in the world, but I think the most important thing is, "How old are baseball players when they can hit home runs the most?" This time, I would like to calculate the home run rate for each batter's age by Bayesian estimation from the data of Major League Baseball. The true definition is different, but for the sake of simplicity, we define it as "home run rate = number of home runs / number of hits". In other words, the home run rate is the probability that a player will hit a home run when he or she stands at bat (except in the case of walks).

The tool to use is

Python
Pandas
Stan (PyStan)

is.

The intended readers are those who are interested in or studying some of baseball, Bayesian statistics, Python, and Stan (almost no one, and vice versa, you don't need to read this article. There is a concern). Therefore, I will not explain each in detail, but I would like to list some good references as appropriate.

Excuse

I'm a stats layman && Bayesian inference layman && Stan layman, so the following content can contain many kinds of mistakes, from fundamental to subtle. Thank you(?).

Data acquisition

Lahman's Database is an overwhelming site for all players in the Major League Baseball for the last 100 years or so. , The number of home runs ...) is published, so I will use it. This time, we will download the 2016 version of the csv file ("Download 2016 Version"-> "2016 – comma-delimited version").

If you unzip the downloaded file, you will lose your motivation because there are infinite csv files, but this time I will use it

--Master.csv (basic information of players) --Batting.csv (Batting results)

Only.

Data preprocessing

ʻImportthe library to be used this time, read the csv file and make itpandas.DataFrame`. Pandas is an insanely convenient Excel-like library that can be used on Python (appropriate), and is very useful for holding data in the form of a table and performing detailed operations on the data.

import numpy as np
import pandas as pd
import pystan
import matplotlib.pyplot as plt

bat = pd.read_csv('Batting.csv')
mas = pd.read_csv('Master.csv')

One column in Batting.csv is the batting record for a year with one player. Let's see what it looks like.

bat.tail()

I see, it feels good. Various grades are linked to the year and the id of the player. However, what I want to know this time is the correlation between age and grades, so I need to add an age column. Actually, in Master.csv, the id of the player and the year of birth are lined up, so I will use this. Let's take a look at the shape of Master.csv.

mas.tail()

OK. Merge the two DataFrames using the player's id as the key. After that, the age of the player is calculated by subtracting the year of birth of the player from the season year. However, in Major League Baseball, the ages of players in a certain season are officially used as of June 30, so for players born in July or later, the year of birth should be the following year.

mas['birthYear'] = mas['birthYear'] + (mas['birthMonth'] >= 7)

bat = pd.merge(bat, mas[['playerID', 'birthYear']], on='playerID')
bat['age'] = bat['yearID'] - bat['birthYear']

By the way, I would like to mention the following two books that contain tips on baseball statistics such as the above-mentioned age story and various analysis results.

Let's analyze it, but it seems that it is not so good to use data that is too old or data of players who have participated only a little, so this time I will cut it from the data frame. Specifically, we will only use the data of players who were born after 1965 and have more than 1000 at-bats in total. To do this, we first calculate the number of at-bats (PA) and then use groupby to get the player's total at-bats.

bat['PA'] = bat['AB'] + bat['BB'] + bat['HBP'] + bat['SH'] + bat['SF']

pa_c = bat.groupby('playerID').sum()[['PA']]
pa_c.columns = ['PA_Career']
bat = pd.merge(bat, pa_c, left_on='playerID', right_index=True)

Let's take a look at the DataFrame once.

bat.head()

A column of player ages has been added to the right, which is nice. Now, let's do an actual analysis.

Model 0 (mean) & why Bayesian estimation is needed

Hereafter, we will start with a simple model and gradually update it to a good feeling.

First of all, I would like to try to add up the number of at-bats and home runs of all players for each age and divide them. For example, "If the total number of hits of a 30-year-old player included in the data is 100,000 at bats, and if he hits 3000 home runs, the home run rate of a 30-year-old player is 3000/100000, which is 0.03." To do. This kind of process is very easy to do with the groupby feature of Pandas.

HRdf = bat.groupby('age').sum()[['AB', 'HR']]
HRdf['HRR'] = HRdf['HR'] / HRdf['AB']

fig, ax = plt.subplots()

HRdf['HRR'].plot(kind='line', style='o-', c='black', ax=ax)
ax.set_ylabel('p')

The horizontal axis is age and the vertical axis is home run rate. I see. It can be seen that the home run rate is slowly increasing from the age of 19 and is highest at the age of 29. The value at that time is 0.034, that is, about 3.4%. After that, it slowly declined, and after the age of 40, it declined sharply.

So, it's okay to end up with the idea that you want to know the home run rate for each age, but there seems to be a problem in thinking of the above graph as the "true home run rate".

(a). ** Individual differences are not taken into account. ** In this figure, a 37,8 year old player is supposed to hit a home run as much as a player in his late 20s. Is this true? If you think calmly, it seems that this is because batters with poor batting results retire in their late 30s, and only batters who can hit home runs remain. Such batters who can be active until their late 30s should have hit more home runs in their late 20s, but that is not clear from this figure. This is because individual differences are not taken into account as all batters have been averaged.

(b). ** There is a lot of noise. ** In the figure above, a 39-year-old batter is hitting an unusual home run, but this would not mean that a baseball player's ability to hit a home run suddenly increases at the age of 39. .. I can't think of any reason why I could suddenly hit a home run just because I was 39 years old. Thinking normally, the true abilities of a 39-year-old baseball player should be about the same as or slightly less than those of a 37-year-old or 38-year-old.

(c). ** I don't know the reliability of the data points. ** In the above data, the numbers of 30-year-old batters and 45-year-old batters are completely different. In the case of processing that takes the overall average like this time, the reliability of the value increases as the number of samples increases, so the 45-year-old data is almost unreliable (representing a value close to the true value). It is unlikely that you are there). Nevertheless, I feel that it is strange to line up with the 30-year-old data in a dignified manner.

For the above reasons, I feel like doing statistical modeling. The above is my (amateur) idea, so if you want to learn the advantages and basic ideas of statistical modeling, you should read the following book, for example.

-[Introduction to Statistical Modeling for Data Analysis-Generalized Linear Model / Hierarchical Bayes Model / MCMC](https://www.amazon.co.jp/%E3%83%87%E3%83%BC%E3] % 82% BF% E8% A7% A3% E6% 9E% 90% E3% 81% AE% E3% 81% 9F% E3% 82% 81% E3% 81% AE% E7% B5% B1% E8% A8 % 88% E3% 83% A2% E3% 83% 87% E3% 83% AA% E3% 83% B3% E3% 82% B0% E5% 85% A5% E9% 96% 80__% E4% B8% 80 % E8% 88% AC% E5% 8C% 96% E7% B7% 9A% E5% BD% A2% E3% 83% A2% E3% 83% 87% E3% 83% AB% E3% 83% BB% E9 % 9A% 8E% E5% B1% A4% E3% 83% 99% E3% 82% A4% E3% 82% BA% E3% 83% A2% E3% 83% 87% E3% 83% AB% E3% 83 % BBMCMC-% E7% A2% BA% E7% 8E% 87% E3% 81% A8% E6% 83% 85% E5% A0% B1% E3% 81% AE% E7% A7% 91% E5% AD% A6-% E4% B9% 85% E4% BF% 9D-% E6% 8B% 93% E5% BC% A5 / dp / 400006973X / ref = pd_sim_14_9? _Encoding = UTF8 & psc = 1 & refRID = TCFDQTQGKQEM42AERGJF)

Bayesian Methods for Hackers

Model 1 (Hierarchical Bayes considering age)

Let's make the following model.

HR[i] \sim \mathrm{Binomial}(AB[i],p[i]), \\\ p[i] = \mathrm{Logistic}(β+r_{age}[age[i]]), \\\ r_{age}[j] \sim \mathrm{Normal}(0,s_{age})

I suddenly gave an unclear expression, so I will explain it in Japanese. Suppose a batter's home run number in a season follows a binomial distribution determined by the number of hits $ AB $ and the home run rate $ p $. Furthermore, suppose that the home run rate $ p $ is represented by a Logistic function that takes the sum of the constant $ \ beta $ and the variable $ r_ {age} $, which is determined by the player's age, as an argument. Here, the value of $ r_ {age} $ for each age follows a normal distribution with mean 0 and variance $ s_ {age} $. From now on, we will use PyStan, a Python wrapper for free software Stan that MCMC can do, to Bayesian infer each parameter.

Here, what is Bayesian inference, what is MCMC, how to use Stan, what is the probability distribution, why people live in the first place, how human consciousness was born, AI I will not explain it here because it will be completely dark and there are many explanations of people who are more decent than me if I say whether I can have consciousness.

For books that are relatively easy to read in Japanese, the basics of Bayesian estimation and statistical modeling are mentioned above ["Introduction to Statistical Modeling for Data Analysis"](https://www.amazon. co.jp/%E3%83%87%E3%83%BC%E3%82%BF%E8%A7%A3%E6%9E%90%E3%81%AE%E3%81%9F%E3%82 % 81% E3% 81% AE% E7% B5% B1% E8% A8% 88% E3% 83% A2% E3% 83% 87% E3% 83% AA% E3% 83% B3% E3% 82% B0 % E5% 85% A5% E9% 96% 80__% E4% B8% 80% E8% 88% AC% E5% 8C% 96% E7% B7% 9A% E5% BD% A2% E3% 83% A2% E3 % 83% 87% E3% 83% AB% E3% 83% BB% E9% 9A% 8E% E5% B1% A4% E3% 83% 99% E3% 82% A4% E3% 82% BA% E3% 83 % A2% E3% 83% 87% E3% 83% AB% E3% 83% BBMCMC-% E7% A2% BA% E7% 8E% 87% E3% 81% A8% E6% 83% 85% E5% A0% B1% E3% 81% AE% E7% A7% 91% E5% AD% A6-% E4% B9% 85% E4% BF% 9D-% E6% 8B% 93% E5% BC% A5 / dp / 400006973X / ref = pd_sim_14_9? _encoding = UTF8 & psc = 1 & refRID = TCFDQTQGKQEM42AERGJF), but for Stan ["Bayes statistical modeling with Stan and R"](https://www.amazon.co.jp/exec/obidos/ASIN/4320112423/statmodeling If you read -22 /), you can understand it.

The code for PyStan is below.

AB  = bat['AB'].values
HR  = bat['HR'].values

age_map = {age: idx+1 for idx, age in enumerate(np.unique(bat['age'].values))}
age = bat['age'].map(age_map).values.astype(int)

data = dict(N=len(AB), N_age=len(np.unique(age)), HR=HR, AB=AB, age=age)

model_code = """
data {
    int N;
    int N_age;
    int HR[N];
    int AB[N];
    int age[N];
}

parameters {
    real beta;
    real<lower=0> s_age;
    vector[N_age] r_age;
}

transformed parameters {
    vector[N] p;
    
    for (i in 1:N)
        p[i] = inv_logit(beta + r_age[age[i]]);
}

model {
    r_age ~ normal(0, s_age);
    HR ~ binomial(AB, p);
}

generated quantities {
    real p_age[N_age];
    for (i in 1:N_age)
        p_age[i] = inv_logit(beta + r_age[i]);
}
"""

fit = pystan.stan(model_code=model_code, data=data, chains=4, iter=1000)

If you run the above code, the calculator will do its best to get the distribution of various parameters ($ \ beta, s_ {age}, r_ {age} $). In Bayesian estimation, the parameter values are not fixed to a single value, but are obtained as a distribution, so the uncertainty of the data can be expressed. For example, the 45-year-old data has fewer samples and is more uncertain than the 30-year-old data, resulting in a wider distribution of parameters. In other words, the problem (c) of the analysis just by averaging the data explained above will be solved.

After all, I want to estimate the home run rate for each age of $ p $, so I will plot it with $ r_ {age} $ with the following code.

def plot_r_p(fit):
    inv_age_map = {v: k for k, v in age_map.items()}
    x_age = [inv_age_map[a] for a in np.unique(age)]
    
    r_age_trace = fit.extract()['r_age']
    p_age_trace = fit.extract()['p_age']
    
    #Median and 50%・ 95%Calculate percentiles at the following points to plot the prediction interval
    percents = [2.5, 25, 50, 75, 97.5]
    
    r_age = []
    p_age = []
    for percent in percents:
        r_age.append(np.percentile(r_age_trace, percent, axis=0))
        p_age.append(np.percentile(p_age_trace, percent, axis=0))
    
    fig, axs = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(12, 4))

    # axs[0] : r_age
    axs[0].plot(x_age, r_age[2], 'o-', color='blue') #Median
    axs[0].fill_between(x_age, r_age[0], r_age[4], alpha=0.1, color='blue') # 95%section
    axs[0].fill_between(x_age, r_age[1], r_age[3], alpha=0.3, color='blue') # 50%section
    axs[0].set(xlabel='age', ylabel=r'$r_{age}$')

    # axs[1] : p_age
    axs[1].plot(x_age, p_age[2], 'o-', color='red')
    axs[1].fill_between(x_age, p_age[0], p_age[4], alpha=0.1, color='red')
    axs[1].fill_between(x_age, p_age[1], p_age[3], alpha=0.3, color='red')
    axs[1].set(xlabel='age', ylabel=r'$p_{age}$')
    
    return fig, axs

fig, axs = plot_r_p(fit)
HRdf['HRR'].plot(kind='line', style='o-', c='black', ax=axs[1]

The blue data on the left is $ r_ {age} $, the black data on the right is the simple mean shown above, and the red data is the estimated age-specific home run rate $ p_ {age} $. In both graphs, the light background color represents the 50% prediction interval and the dark background color represents the 95% prediction interval.

So, when you look at the results, it feels like hmm. I think that problem (c) has been partially solved, but nothing has been done about (a) the problem of individual differences and (b) the problem of noise. For example, as usual, at the age of 39, he is supposed to hit a home run suddenly, and as a new problem, his ability improves after his 40s, when there are few data points, which is an unrealistic result (this). I think it's probably because the likelihood is higher when it gets closer to the average of all ages).

Model 2 (considering time series)

In order to solve the above problems, we will consider the time series. What this means is that the previous model treated age-specific abilities completely independently (corresponding to Model 1's equation where $ r_ {age} $ follows a normal distribution with an average of 0). However, if you think about it, your ability at age j should be close to your ability at age (j-1). If this is reflected in the model, I think it will be as follows.

HR[i] \sim \mathrm{Binomial}(AB[i], p[i]), \\\ p[i] = \mathrm{Logistic}(\beta + r_{age}[age[i]]), \\\ r_{age}[j] \sim \mathrm{Normal}(r_{age}[j-1], s_{age}), \\\ \sum_{j} r_{age}[j] = 0

In this model, the mean of the normal distribution that $ r_ {age} [j] $ follows is $ r_ {age} [j-1] $. This represents the assumption that "the ability at age j should be close to the ability at age (j-1)".

AB  = bat['AB'].values
HR  = bat['HR'].values

age_map = {age: idx+1 for idx, age in enumerate(np.unique(bat['age'].values))}
age = bat['age'].map(age_map).values.astype(int)

data = dict(N=len(AB), N_age=len(np.unique(age)), HR=HR, AB=AB, age=age)

model_code = """
data {
    int N;
    int N_age;
    int HR[N];
    int AB[N];
    int age[N];
}

parameters {
    real beta;
    real<lower=0> s_age;
    vector[N_age] r_age;
}

transformed parameters {
    vector[N] p;
    
    for (i in 1:N)
        p[i] = inv_logit(beta + r_age[age[i]]);
}

model {
    r_age[1] ~ normal(-sum(r_age[2:N_age]), 0.001);
    r_age[2:N_age] ~ normal(r_age[1:(N_age-1)], s_age);
    HR ~ binomial(AB, p);
}

generated quantities {
    real p_age[N_age];
    for (i in 1:N_age)
        p_age[i] = inv_logit(beta + r_age[i]);
}
"""

fit2 = pystan.stan(model_code=model_code, data=data, chains=4, iter=1000)

The only change from model 1 is the part where r_age [2: N_age] ~ normal (r_age [1: (N_age-1)], s_age) is set in model (sum of $ r_ {age} $). I feel that the part where r_age [1] ~ normal (-sum (r_age [2: N_age]), 0.001) is set to 0 is really not very good ...).

The plot of the result looks like this:

I see~. There are two advantages compared to Model 1.

――The increase in ability at the age of 39 was suppressed a little. ――The ability in my 40s has been declining year by year

All of these are effects that take time series into consideration, and are considered to be realistic results. So, I feel that the noise in problem (b) has improved a little. So, finally, let's make a model that considers individual differences.

Model 3 (considering individual differences)

So, add a term to the model called $ r_ {id} $ that represents individual differences. This is a variable that represents the difference in ability of each player and should follow a normal distribution with an average of 0. $HR[i] \sim \mathrm{Binomial}(AB[i], p[i]), \\\ p[i] = \mathrm{Logistic}(\beta + r_{age}[age[i]] + r_{id}[id[i]]), \\\ r_{age}[j] \sim \mathrm{Normal}(r_{age}[j-1], s_{age}), \\\ \sum_{j} r_{age}[j] = 0, \\\ r_{id}[j] \sim \mathrm{Normal}(0, s_{id})$

id_map = {id: idx+1 for idx, id in enumerate(np.unique(bat['playerID'].values))}
id = bat['playerID'].map(id_map).values.astype(int)

data = dict(N=len(AB), N_age=len(np.unique(age)), N_id=len(np.unique(id)), HR=HR, AB=AB, id=id, age=age)

model_code = """
data {
    int N;
    int N_id;
    int N_age;
    int HR[N];
    int AB[N];
    int id[N];
    int age[N];
}

parameters {
    real beta;
    real<lower=0> s_age;
    real<lower=0> s_id;
    vector[N_age] r_age;
    vector[N_id] r_id;
}

transformed parameters {
    vector[N] p;
    
    for (i in 1:N)
        p[i] = inv_logit(beta + r_age[age[i]] + r_id[id[i]]);
}

model {
    r_id ~ normal(0, s_id);
    r_age[1] ~ normal(-sum(r_age[2:N_age]), 0.001);
    r_age[2:N_age] ~ normal(r_age[1:(N_age-1)], s_age);
    HR ~ binomial(AB, p);
}

generated quantities {
    real p_age[N_age];
    for (i in 1:N_age)
        p_age[i] = inv_logit(beta + r_age[i]);
}
"""

fit3 = pystan.stan(model_code=model_code, data=data, chains=4, iter=1000)

Let's see the result.

I see~. Isn't this a pretty good result? Age changes smoothly with each ability and looks pretty reasonable. Here, the estimated value of $ p_ {age} $ is much smaller than the measured value, because the estimated value represents "$ p_ {age} $ of the average player by age". I think. In the measured value, I think that the greater the player, the more at-bats, so if you take the overall average, the weight of the amazing player will be higher than the player who does not have much weight, and as a result, the value will be greater than the average player's $ p_ {age} $. I will come out.

The bottom line is that 29-year-olds are most likely to hit home runs, just over 2.5% for the average player. Looking at the changes in ability around the age of 29, you can see that the growth before the age of 29 is more rapid than the decline after the age of 29. Also, for example, $ p_ {age} $ around 20 years old is a little higher than 1.5%, so on average, players around 30 years old are $ (2.5 / 1.5) \ simeq 1.7 than players around 20 years old. You can expect to hit a home run about $ double.

I think I can still do various things, but for the time being, around here ...

Summary

So, as a result of Bayesian inference of how old a baseball player can hit a home run most by Bayesian estimation using PyStan, it was found that a 29-year-old player can hit the most home run.

that's all, thank you very much.

Bonus: I did the same research on other indicators, but I feel that listing them is out of the scope of qiita, so it's different (http://hoture6.hatenablog.com/entry/2017/06). / 08/212256). If you are interested, please.

Find out with Python and Stan how old a baseball player can hit a home run the most