The Lorenz curve and Gini coefficient of the overall problem last night seem to be deep when I examine them, so I will summarize what I investigated.
・ Lorenz curve ・ Gini coefficient ・ Numerical integration with Scipy ・ Derivation
What is the Lorenz curve? "For example, when examining the income distribution, categorize the income. Arrange the categorized income in ascending order. Parallel the number of people belonging to that category. Calculate the cumulative value of each. Standardize each cumulative maximum value to 1. And the vertical axis is the standardized income cumulative value, and the horizontal axis is the curve that appears when the standardized numerical values in the ordered order are drawn. " This time, I will draw a Lorenz curve with the score distribution of the students last night. Since the number of people is small here, we will only rank and accumulate without categorizing. From last night's data, the G1 score distribution can be drawn as follows.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#Data capture
student_data_math = pd.read_csv('./chap3/student-mat.csv', sep =';')
#'M'Get the data of
df0 = student_data_math[student_data_math['sex'].isin(['M'])]
#'G1'Sorted in ascending order
df = df0.sort_values(by=['G1'])
#'Ct'Add a numeric column called
df['Ct']=np.arange(1,len(df)+1)
#Substitute a numeric sequence for x
x = df['Ct']
#Substitute the cumulative value of G1 data for y
y = df['G1'].cumsum()
#Graph drawing
fig, (ax1,ax2) = plt.subplots(2, 1, figsize=(8,2*5))
#x,Standardize with the maximum value of y and draw with a blue line
ax1.plot(x/max(x),y/max(y),'blue', label='M')
#As a uniform distribution, y=x graph drawing
ax1.plot(x/max(x),x/max(x),'black', label = 'y = x')
#Draw frequency distribution on ax2 (Because Grade is 20 levels, binz=20)
ax2.hist(y/max(y), bins = 20, range =(0,1),label ='M')
ax1.set_xlabel('peoples')
ax1.set_ylabel('G1_Grade.cumsum')
ax2.set_ylabel('freq.')
ax2.set_xlabel('G1_Grade.cumsum')
ax1.legend()
ax1.grid(True)
plt.show()
And when I repeated it, it became as follows. In other words, we can see that there is almost no difference between men and women.
For reference, the Gini coefficient is defined as twice the area surrounded by the y = x graph and the Lorenz curve in the figure above. 【reference】 ・ Gini coefficient @ wikipedia Below, I will quote from the reference. 「G = A/(A + B). It is also equal to 2A and to 1 − 2B due to the fact that A + B = 0.5 (since the axes scale from 0 to 1).」
Example How to use integrate.cumtrapz
from scipy import integrate
x = np.linspace(0, 2, num=2**4+1)
y = x**4
y_int = integrate.cumtrapz(y, x, initial=0)
plt.plot(x, y_int, 'ro', x, y[0] + 0.2 * x**5, 'b-')
plt.show()
Similar to the above, the Gini coefficient can also be drawn below.
df0 = student_data_math[student_data_math['sex'].isin(['M'])] #F
df = df0.sort_values(by=['G1'])
df['Ct']=np.arange(1,len(df)+1)
x = df['Ct']
y1 = df['Ct']
#y2 is the cumulative score
y2 = df['G1'].cumsum()
#y_int1 is y=Integral of x
y_int1 = integrate.cumtrapz(y1/max(y1), x/max(x), initial=0)
#y_int2 is the integral value of the cumulative value of G1
y_int2 = integrate.cumtrapz(y2/max(y2), x/max(x), initial=0)
#Plot each
plt.plot(x/max(x), 2*y_int1,'black', label = 'y=x')
plt.plot(x/max(x), 2*y_int2, 'blue',label = 'M_A')
plt.plot(x/max(x), 1-2*y_int1,'black',label ='1-2*(y=x)')
plt.plot(x/max(x), 1-2*y_int2,'blue', label ='M_1-2*A')
plt.xlabel('peoples')
plt.ylabel('integrate')
plt.legend()
plt.show()
The difference between the two when peaples = 1 is the Gini coefficient. Looking at the numbers
print(1-2*y_int2[len(df)-1])
#0.17198115609880305 M
#0.17238700073127444 F
print(2*y_int1[len(df)-1])
#0.999971403242873
is. In other words, there seems to be an error around the 5th digit, but the difference between men and women is likely to be in the 4th digit. The woman seems to be a little bigger.
\begin{align}
GI &=& \frac{\Sigma_{i=1}^{n}\Sigma_{j=1}^{n}|x_i-x_j|}{2\Sigma_{i=1}^{n}\Sigma_{j=1}^{n}x_j}\\
&=& \frac{\Sigma_{i=1}^{n}\Sigma_{j=1}^{n}|x_i-x_j|}{2n\Sigma_{i=1}^{n}x_i}\\
&=& \frac{\Sigma_{i=1}^{n}\Sigma_{j=1}^{n}|x_i-x_j|}{2n^2\bar x}\\
\end{align}
Aside from deriving the first equation, the calculation can be done with the following code.
xbar = df['G1'].mean() #Find the average value
s = []
s = df['G1'].loc[:].values #Extract only the value of G1
n = len(df) #Ask for the number of people
GI = 0
#The average seems to be a little different between men and women
print(xbar) #10.620192307692308 F 11.229946524064172 M
for i in range(1,len(df),1):
a = s[i]
for j in range(1,len(df),1):
b = s[j]
GI += np.abs(a-b)
GI = GI/(2*n*n*xbar)
print(GI) #0.16938137688477206 F 0.16805449452508275 M
The numbers obtained tend to be relatively the same, but slightly different from those obtained from the so-called area above.
After all, I think we have to seriously derive the formula of this method. First, find the area of the blue part A in the figure below. In the above figure, the vertical axis is standardized, so x1 and x2 should be read as $ x1⇒ \ frac {x1} {x1 + x2} $, $ x2⇒ \ frac {x2} {x1 + x2} $. Then, if you make the following forcible transformation of the formula, you can see that the above formula of Gini coefficient holds in the case of these two points. It seems that it can be proved by generalizing this, but I am not motivated, so I will take another opportunity.
\begin{align}
2*A&=2(1*1*1/2-1/2*\frac{x1}{x1+x2}*1/2-\frac{x1}{x1+x2}*1/2-1/2*\frac{x2}{x1+x2}*1/2)\\
&=\frac{2(x1+x2)}{2(x1+x2)}-\frac{x1}{2(x1+x2)}-\frac{2x1}{2(x1+x2)}-\frac{x2}{2(x1+x2)}\\
&=\frac{2(x1+x2)-x1-2x1-x2}{2(x1+x2)}\\
&=\frac{x2-x1}{2(x1+x2)}\\
&=\frac{|x2-x1|+|x1-x2|}{2*2^2\bar x}
\end{align}
・ I tried to find the Lorenz curve and Gini coefficient. ・ The Gini coefficient that reflects the student grade distribution was calculated. ・ The difference in score distribution between men and women is small
・ I want to prove the general formula of the Gini coefficient ・ I would like to apply it to investment efficiency (portfolio) such as stocks, exchange rates, and deposits and savings.
Recommended Posts