[PYTHON] Estimate the average score of examinees at Sapporo Medical University

Sapporo Medical University Entrance Examination Results Analysis

In this article, I would like to roughly estimate the average score and standard deviation of all examinees based on the examinee data published on the Sapporo Medical University website.

1. Assumption of score distribution

Data published by the university shows the average, highest, and lowest scores of successful applicants. Since only the number of examinees can be grasped, I assumed that the score of the examinees follows a normal distribution, and used the given data to predict the parameters $ \ mu, \ sigma $ of the normal distribution. ..

2. Data collection

First, the data used for the forecast are summarized in the table. Here, the lowest ranking is 75 every year, the capacity of the general entrance examination of Sapporo Medical University is 75 people, and even if additional passers occur, the score of the 75th passer will be announced as the lowest score. To do.

入試結果まとめ.png

3. Analysis method

The analysis method is very simple. Just solve the following simultaneous equations.

\left\{
\begin{split}
Percentage of successful applicants&= \int_{Lowest point}^{\infty} \frac{1}{\sqrt{2 \pi} \sigma} \exp (-\frac{(x - \mu)^2}{2 \sigma^2}) dx \\

Average score of successful applicants&= \frac{\int_{Lowest point}^{\infty} \frac{x}{\sqrt{2 \pi} \sigma} \exp (-\frac{(x - \mu)^2}{2 \sigma^2})}{\int_{Lowest point}^{\infty} \frac{1}{\sqrt{2 \pi} \sigma} \exp (-\frac{(x - \mu)^2}{2 \sigma^2}) dx}
\end{split}
\right.

Let me give you a little supplementary explanation. The first formula is

Percentage of successful applicants= \frac{Bottom rank}{Number of examinees} = \int_{Lowest point}^{\infty}Normal distribution dx

is what it means. If you integrate the normal distribution from the lowest point to infinity, you can get the percentage of successful applicants.

The second formula is

Average score of successful applicants=Expected value of successful applicant score= \int_{Lowest point}^{\infty}Normalized constant\times x \times normal distribution dx= \frac{\int_{Lowest point}^{\infty}x \times normal distribution dx}{\int_{Lowest point}^{\infty}Normal distribution dx}

It means $$. The normalized constant is the constant $ C $ for the normal distribution dx = 1 $, that is, $ 1 / \ int_ {lowest point} ^ {\ infinty} normal. The distribution dx $.

$$ Now, the problem here is that the integral of the first equation is unknown for $ \ mu, \ sigma $, though I think that it is only calculated that the simultaneous equations with two unknowns are obtained. It cannot be calculated as it is. So I gave up trying to find a mathematically exact solution and decided to substitute various pairs of $ \ mu and \ sigma $ values to find the best fit. However, I can't do such a troublesome calculation, so pyhon is here.

4. Find an approximate solution using Pyhon

First, import the required libraries and modules.

import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
from scipy import integrate
import japanize_matplotlib

Next, create the data required for forecasting.

mu_I = [950,1000]
sigma_I = [60,90]
year = [2018,2019,2020]
n = [321,267,281]
pass_n = 75
pass_ratio = [pass_n/i for i in n]
pass_average = [1063,1073,1072]
worst = [1023,1029,1022]
mu_points = np.linspace(mu_I[0],mu_I[1],100)            
sigma_points = np.linspace(sigma_I[0],sigma_I[1],60)    
pass_ratio_err = 0.005
pass_average_err = 1

It also defines the functions needed to calculate the expected score for successful applicants.

def norm(x,mu, sigma):
    return (x/(np.sqrt(2*np.pi)*sigma))*np.exp(-(x - mu)**2/(2*(sigma**2)))

Then, with the following code, plot point $ (\ mu, \ sigma) $

--"Make the percentage of successful applicants accurate to $ \ pm 0.5 $%" but "Do not make the average score of successful applicants accurate to $ \ pm 1 $" (yellow) --"Do not make the percentage of successful applicants accurate to $ \ pm 0.5 $%" but "Make the average score of successful applicants accurate to $ \ pm 1 $" (blue) --"Make the percentage of successful applicants accurate to $ \ pm 0.5 $%" and "Make the average score of successful applicants accurate to $ \ pm 1 $" (green)

It was classified into 3 ways.

ratio_average = []
ratio_only = []
average_only = []
for i in range(len(year)):
    ratio_average.append([[],[]])
    ratio_only.append([[],[]])
    average_only.append([[],[]])
    for mu_point in mu_points:
        for sigma_point in sigma_points:
            mu = mu_point
            sigma = sigma_point
            cdf = st.norm.cdf(worst[i], mu, sigma)
            int_pdf = integrate.quad(norm,worst[i], np.inf, args = (mu, sigma))[0]
            calculate_pass_ratio = 1 - cdf
            calculate_pass_average = int_pdf / calculate_pass_ratio
            if np.abs(calculate_pass_ratio - pass_ratio[i]) < pass_ratio_err:
                if np.abs(calculate_pass_average - pass_average[i]) < pass_average_err:
                    ratio_average[i][0].append(mu)
                    ratio_average[i][1].append(sigma)
                else:
                    ratio_only[i][0].append(mu)
                    ratio_only[i][1].append(sigma)
            elif np.abs(calculate_pass_average - pass_average[i]) < pass_average_err:
                    average_only[i][0].append(mu)
                    average_only[i][1].append(sigma)
            else:
                pass

Finally, the classified points were color coded and plotted on the graph.

fig , axes = plt.subplots(1,3,figsize = (18,5))
for i, ax in zip([0,1,2],axes):
    ax.scatter(ratio_only[i][0],ratio_only[i][1],c = 'y', s = 2, label= 'The percentage of successful applicants{:.3f} $\\pm$ {}%'.format(pass_ratio[i], pass_ratio_err*100))
    ax.scatter(average_only[i][0],average_only[i][1],c = 'b', s = 2,label = 'The average number of successful applicants{} $\\pm$ {}point'.format(pass_average[i], pass_average_err))
    ax.scatter(ratio_average[i][0],ratio_average[i][1],c = 'g', s = 2,  label = 'Satisfy both of the above two conditions')
    ax.set_xlim(mu_I[0], mu_I[1])
    ax.set_ylim(sigma_I[0], sigma_I[1])
    ax.set_xlabel('$\\mu$')
    ax.set_ylabel('$\\sigma$')
    ax.legend(loc = 'best')
    ax.set_title('{}Year'.format(year[i]))
plt.show()

The execution result is as shown in the graph below.

予測結果グラフ.png

The table below shows the approximate values of $ \ mu and \ sigma $ read from the green area of the graph.

予測結果.png

5. Conclusion

This year (FY2020), the average score of successful applicants was higher than that of FY2018, but the average score of examinees was lower than that of FY2018. In addition, the standard deviation is increasing year by year, and it can be said that questions are being asked to widen the point difference among the examinees.

Recommended Posts

Estimate the average score of examinees at Sapporo Medical University
Find the "minimum passing score" from the "average score of examinees", "average score of successful applicants", and "magnification" of the entrance examination
Shout Hello, Reiwa! At the beginning of Reiwa
Estimate the peak infectivity of the new coronavirus
Python Basic Course (at the end of 15)