[PYTHON] Find the "minimum passing score" from the "average score of examinees", "average score of successful applicants", and "magnification" of the entrance examination

Purpose

When solving past questions of entrance exams for junior high school, high school, university, etc., the "average score of examinees", "average score of successful applicants", and "magnification (number of successful applicants / number of examinees)" are disclosed, but "minimum passing score" May be private. Here, we consider predicting the minimum passing score.

Method

I used the entrance examination data (2013-2019) of M junior high school published on the following site. Gorogoro junior high school entrance exam https://www.goro-goro.net/2019-musashi

For the data for each year, the following operations were performed, assuming that the average score of examinees was $ \ mu_a $, the average score of successful applicants was $ \ mu_p $, and the magnification was $ r $. (Each operation will be explained in detail in the next section.)

  1. Determine the value of $ \ sigma $. Assume that the test taker's score follows a normal distribution with mean $ \ mu_a $ and standard deviation $ \ sigma $.
  2. In the normal distribution, $ b $ such that the area of $ x> b $ is $ \ frac {1} {r} $ is the provisional minimum passing score.
  3. Calculate the average of the $ x> b $ part of the distribution, and set it as the provisional average score of successful applicants $ {\ mu_p} ^ {\ prime} $.
  4. Update the standard deviation $ \ sigma $ from the magnitude relationship between $ {\ mu_p} ^ {\ prime} $ and $ \ mu_p $, and return to 1. Repeat this.
  5. When $ {\ mu_p} ^ {\ prime} = \ mu_p $, the minimum passing score $ b $ at that time is used as the predicted value.

After that, the predicted value of the lowest passing score and the actual value were plotted, and the accuracy was compared with the case of simple prediction (). () Prediction that the minimum passing score = (average score of examinees + average score of successful applicants) / 2

Details of the method

1. Determine the value of $ \ sigma $. Assume that the test taker's score follows a normal distribution with mean $ \ mu_a $ and standard deviation $ \ sigma $. </ font> </ b>

This assumption was considered valid when the number of test takers was large enough and there were no test takers with full or zero points. Of course, the true score distribution is a discrete distribution, but we consider this as a continuous distribution. The first $ \ sigma $ can be decided at last, and will be updated later to approach the optimum value. If you look at the above graph as "probability distribution of the score of one examinee", the $ y $ axis is "probability density (probability when integrated)", and if you see it as "score distribution of all examinees" The $ y $ axis is "probability density x number of people (when integrated, it becomes the number of people)".
2. In the normal distribution, $ b $ such that the area of $ x> b $ is $ \ frac {1} {r} $ is the provisional minimum passing score. .. </ font> </ b>

This can be expressed as a mathematical formula

\int^{b}_{\infty} \frac{1}{\sqrt{2\pi{\sigma}^2}}\exp{\left(-\frac{(x-\mu_a)^2}{2\sigma^2}\right)} \mathrm{d}x = 1-\frac{1}{r}

Equivalent to finding $ b $ that satisfies. The left side is the cumulative distribution function of this normal distribution with $ b $ substituted. This formula is cannot be solved analytically </ b>, so I used programming (Python) this time (the code will be described later). The straight line $ x = b $ is an image that separates passers and failers by the score distribution.
3. Calculate the average of the $ x> b $ part of the distribution, and use the provisional average score of successful applicants as $ {\ mu_p} ^ {\ prime} $. </ font> </ b>

This is equivalent to finding the average (expected value) of the truncated normal distribution </ b>.

{\mu_p}^{\prime} = \mu_a + \sigma\frac{\phi\left(\frac{b - \mu_a}{\sigma}\right)}{1-\Phi\left(\frac{b - \mu_a}{\sigma}\right)}

($ \ Phi (\ cdot) $ represents the probability density function of the standard normal distribution, and $ \ Phi (\ cdot) $ represents its cumulative distribution function). Reference: https://bellcurve.jp/statistics/blog/18075.html
4. Update the standard deviation $ \ sigma $ from the magnitude relationship between $ {\ mu_p} ^ {\ prime} $ and $ \ mu_p $, and return to 1. Repeat this. </ font> </ b>

As shown in the above figure, the larger the standard deviation $ \ sigma $ determined at the beginning, the larger $ {\ mu_p} ^ {\ prime} $ (monotonically increasing), so use the dichotomy . , $ {\ mu_p} ^ {\ prime} = \ mu_p $ Search for $ \ sigma $. As shown in the above figure, the search range is halved each time, so if you repeat it about 100 times, you will reach the target value with sufficient accuracy. (Note) Actually, $ \ sigma $ and $ {\ mu_p} ^ {\ prime} $ are linear functions as shown in the above figure. Therefore, it is possible to find $ \ sigma $ such that $ {\ mu_p} ^ {\ prime} = \ mu_p $ without actually using the dichotomy. I noticed this after the analysis, but since the calculation accuracy and calculation time are almost the same, the analysis by the dichotomy method is posted as it is.
5. When $ {\ mu_p} ^ {\ prime} = \ mu_p $, the minimum passing score $ b $ at that time is used as the predicted value.

Code and results

The above "prediction using the normal distribution" was performed, and the accuracy was compared with the "prediction that the minimum passing score = (average score of examinees + average score of successful applicants) / 2".

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score, mean_absolute_error

# 2013-2019 data from https://www.goro-goro.net/2019-musashi
jukensya = [433, 556, 519, 590, 577, 541, 569]
goukakusya = [177, 177, 185, 183, 187, 185, 186]
juken_heikin = [138.5, 173.3, 172.9, 167.0, 165.5, 186.9, 170.5]
goukaku_heikin = [166.5, 210.7, 210.3, 202.0, 197.0, 221.5, 205.2]
goukaku_saitei = [146, 192, 188, 184, 180, 201, 185]

goukaku_saitei_pred = []

for i in range(7):  # 2013-Analysis of 7 times in 2019
    r = jukensya[i] / goukakusya[i]  #magnification
    mu_a = juken_heikin[i]  #Candidate average score
    mu_p = goukaku_heikin[i]  #Average score of successful applicants
    
    sigma_l = 0.1
    sigma_r = 1000
    sigma = (sigma_l + sigma_r) / 2  #Standard deviation, 0.Search in the range of 1 to 1000
    
    for i in range(100):  #Repeat the dichotomy 100 times
        b = norm.isf(1 / r, mu_a, sigma)  #Provisional minimum passing score
        mu_p_prime = mu_a + sigma * norm.pdf((b - mu_a) / sigma) \
                     / (1 - norm.cdf((b - mu_a) / sigma))  #Provisional average score of successful applicants
        if mu_p_prime < mu_p:
            sigma_l = sigma
        else:
            sigma_r = sigma
        sigma = (sigma_l + sigma_r) / 2
    
    goukaku_saitei_pred.append(b)

#Prediction that the midpoint between the average score of examinees and the average score of successful applicants is the lowest passing score
goukaku_saitei_pred_rough = [(goukaku_heikin[i] + juken_heikin[i]) / 2 for i in range(7)]

## R^Confirmation of 2 and MAE##

print("Pred 1:Results of prediction using normal distribution")
print("R^2: {:.4f}".format(r2_score(goukaku_saitei, goukaku_saitei_pred)))
print("MAE: {:.4f}".format(mean_absolute_error(goukaku_saitei, goukaku_saitei_pred)))
print("")
print("Pred 2: (Candidate average score+Average score of successful applicants) /The result of the prediction that it is 2")
print("R^2: {:.4f}".format(r2_score(goukaku_saitei, goukaku_saitei_pred_rough)))
print("MAE: {:.4f}".format(mean_absolute_error(goukaku_saitei, goukaku_saitei_pred_rough)))

##Measured value-Creating a forecast plot##

fig = plt.figure(figsize=(10, 5))
lim = [140, 220]  #Range of both axes
s = 17  #font size

# Pred 1:Results of prediction using normal distribution
ax0 = fig.add_subplot(1,2,1)

ax0.plot(goukaku_saitei, goukaku_saitei_pred, "o", markersize=8)
ax0.plot(lim, lim, "k-")

ax0.set_title('Pred 1', fontsize=s)
ax0.set_xlabel('True', fontsize=s)
ax0.set_ylabel('Predicted', fontsize=s)
ax0.tick_params(labelsize=s)
ax0.set_xlim(lim)
ax0.set_ylim(lim)
ax0.set_aspect('equal')

# Pred 2: (Candidate average score+Average score of successful applicants) /The result of the prediction that it is 2
ax1 = fig.add_subplot(1,2,2)

ax1.plot(goukaku_saitei, goukaku_saitei_pred_rough, "o", markersize=8)
ax1.plot(lim, lim, "k-")

ax1.set_title('Pred 2', fontsize=s)
ax1.set_xlabel('True', fontsize=s)
ax1.set_ylabel('Predicted', fontsize=s)
ax1.tick_params(labelsize=s)
ax1.set_xlim(lim)
ax1.set_ylim(lim)
ax1.set_aspect('equal')

#Save as PNG
fig.tight_layout()
fig.savefig("plot.png ")

output The R ^ 2 value is the coefficient of determination, and the closer it is to 1, the better the accuracy. MAE is the average of the absolute values of the prediction errors, and the closer it is to 0, the better the accuracy.

Pred 1:Results of prediction using normal distribution
R^2: 0.9892
MAE: 1.4670

Pred 2: (Candidate average score+Average score of successful applicants) /The result of the prediction that it is 2
R^2: 0.9583
MAE: 2.5571

plot.png

Consideration

It can be said that Pred 1, which is a prediction using a normal distribution, is more accurate </ b> than Pred 2, which is a simple prediction. The average prediction error is about 1.5 points, which seems to be sufficient as a guide for actual study. Also, looking at the plot of Pred 2, the prediction error is large in the data near 145 points in the lower left. This is probably because Pred 2 uses the "midpoint between the average score of examinees and the average score of successful applicants" regardless of the exam magnification, so it cannot respond to changes in the magnification (only this year). The magnification is as low as 2.4 times, 2.8 to 3.2 times in other years). On the other hand, in Pred 1, the prediction error is suppressed to the same level as other data or less, and it can be said that the method using the normal distribution can cope with the change in magnification.

In Pred 1, many data have a similar deviation, that is, the predicted value is about 1 to 2 smaller than the actual value. Therefore, if there is sufficient data, it is possible that the accuracy will be even better by finding the constant corresponding to "about 1 to 2" this time and adding the correction to the predicted value.

reference

scipy.stats --API for scipy statistical functions --keisuke's blog http://kaisk.hatenadiary.com/entry/2015/02/17/192955

3-5.Skewness and kurtosis|Statistics time|Statistics WEB https://bellcurve.jp/statistics/course/17950.html