[PYTHON] [Introduction to Scipy] Calculation of Lorenz curve and Gini coefficient ♬

The Lorenz curve and Gini coefficient of the overall problem last night seem to be deep when I examine them, so I will summarize what I investigated.

What i did

・ Lorenz curve ・ Gini coefficient ・ Numerical integration with Scipy ・ Derivation

・ Lorenz curve

What is the Lorenz curve? "For example, when examining the income distribution, categorize the income. Arrange the categorized income in ascending order. Parallel the number of people belonging to that category. Calculate the cumulative value of each. Standardize each cumulative maximum value to 1. And the vertical axis is the standardized income cumulative value, and the horizontal axis is the curve that appears when the standardized numerical values in the ordered order are drawn. " This time, I will draw a Lorenz curve with the score distribution of the students last night. Since the number of people is small here, we will only rank and accumulate without categorizing. From last night's data, the G1 score distribution can be drawn as follows.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Data capture
student_data_math = pd.read_csv('./chap3/student-mat.csv', sep =';')
#'M'Get the data of
df0 = student_data_math[student_data_math['sex'].isin(['M'])]
#'G1'Sorted in ascending order
df = df0.sort_values(by=['G1'])
#'Ct'Add a numeric column called
df['Ct']=np.arange(1,len(df)+1)

#Substitute a numeric sequence for x
x = df['Ct']
#Substitute the cumulative value of G1 data for y
y = df['G1'].cumsum()
#Graph drawing
fig, (ax1,ax2) = plt.subplots(2, 1, figsize=(8,2*5))
#x,Standardize with the maximum value of y and draw with a blue line
ax1.plot(x/max(x),y/max(y),'blue', label='M')
#As a uniform distribution, y=x graph drawing
ax1.plot(x/max(x),x/max(x),'black', label = 'y = x')
#Draw frequency distribution on ax2 (Because Grade is 20 levels, binz=20)
ax2.hist(y/max(y), bins = 20, range =(0,1),label ='M')
ax1.set_xlabel('peoples')
ax1.set_ylabel('G1_Grade.cumsum')
ax2.set_ylabel('freq.')
ax2.set_xlabel('G1_Grade.cumsum')
ax1.legend()
ax1.grid(True)
plt.show()

Figure_19-LorenzMhist.png Figure_19-LorenzFhist.png And when I repeated it, it became as follows. In other words, we can see that there is almost no difference between men and women. Figure_19-LorenzMF.png

・ Gini coefficient

For reference, the Gini coefficient is defined as twice the area surrounded by the y = x graph and the Lorenz curve in the figure above. 【reference】 ・ Gini coefficient @ wikipedia Below, I will quote from the reference. Economics_Gini_coefficient2.svg.png 「G = A/(A + B). It is also equal to 2A and to 1 − 2B due to the fact that A + B = 0.5 (since the axes scale from 0 to 1).」

・ Quadrature with Scipy

Example How to use integrate.cumtrapz

from scipy import integrate
x = np.linspace(0, 2, num=2**4+1)
y = x**4
y_int = integrate.cumtrapz(y, x, initial=0)
plt.plot(x, y_int, 'ro', x, y[0] + 0.2 * x**5, 'b-')
plt.show()

Figure_20ex.png Similar to the above, the Gini coefficient can also be drawn below.

df0 = student_data_math[student_data_math['sex'].isin(['M'])] #F
df = df0.sort_values(by=['G1'])
df['Ct']=np.arange(1,len(df)+1)
x = df['Ct'] 
y1 = df['Ct']
#y2 is the cumulative score
y2 = df['G1'].cumsum()
#y_int1 is y=Integral of x
y_int1 = integrate.cumtrapz(y1/max(y1), x/max(x), initial=0)
#y_int2 is the integral value of the cumulative value of G1
y_int2 = integrate.cumtrapz(y2/max(y2), x/max(x), initial=0)
#Plot each
plt.plot(x/max(x), 2*y_int1,'black', label = 'y=x')
plt.plot(x/max(x), 2*y_int2, 'blue',label = 'M_A')
plt.plot(x/max(x), 1-2*y_int1,'black',label ='1-2*(y=x)')
plt.plot(x/max(x), 1-2*y_int2,'blue', label ='M_1-2*A')
plt.xlabel('peoples')
plt.ylabel('integrate')
plt.legend()
plt.show()

The difference between the two when peaples = 1 is the Gini coefficient. Figure_19-GiniM.png Looking at the numbers

print(1-2*y_int2[len(df)-1]) 
#0.17198115609880305 M 
#0.17238700073127444 F

print(2*y_int1[len(df)-1]) 
#0.999971403242873 

is. In other words, there seems to be an error around the 5th digit, but the difference between men and women is likely to be in the 4th digit. The woman seems to be a little bigger.

Another way to find the Gini coefficient

\begin{align}
GI &=& \frac{\Sigma_{i=1}^{n}\Sigma_{j=1}^{n}|x_i-x_j|}{2\Sigma_{i=1}^{n}\Sigma_{j=1}^{n}x_j}\\
&=& \frac{\Sigma_{i=1}^{n}\Sigma_{j=1}^{n}|x_i-x_j|}{2n\Sigma_{i=1}^{n}x_i}\\
&=& \frac{\Sigma_{i=1}^{n}\Sigma_{j=1}^{n}|x_i-x_j|}{2n^2\bar x}\\
\end{align}

Aside from deriving the first equation, the calculation can be done with the following code.

xbar = df['G1'].mean() #Find the average value
s = []
s = df['G1'].loc[:].values #Extract only the value of G1
n = len(df) #Ask for the number of people
GI = 0
#The average seems to be a little different between men and women
print(xbar)  #10.620192307692308 F  11.229946524064172 M
for i in range(1,len(df),1):
    a = s[i]
    for j in range(1,len(df),1):
        b = s[j]
        GI += np.abs(a-b)

GI = GI/(2*n*n*xbar)
print(GI)  #0.16938137688477206 F  0.16805449452508275 M

The numbers obtained tend to be relatively the same, but slightly different from those obtained from the so-called area above.

・ Derivation

After all, I think we have to seriously derive the formula of this method. First, find the area of the blue part A in the figure below. Lorenz-.png In the above figure, the vertical axis is standardized, so x1 and x2 should be read as $ x1⇒ \ frac {x1} {x1 + x2} $, $ x2⇒ \ frac {x2} {x1 + x2} $. Then, if you make the following forcible transformation of the formula, you can see that the above formula of Gini coefficient holds in the case of these two points. It seems that it can be proved by generalizing this, but I am not motivated, so I will take another opportunity.

\begin{align}
2*A&=2(1*1*1/2-1/2*\frac{x1}{x1+x2}*1/2-\frac{x1}{x1+x2}*1/2-1/2*\frac{x2}{x1+x2}*1/2)\\
&=\frac{2(x1+x2)}{2(x1+x2)}-\frac{x1}{2(x1+x2)}-\frac{2x1}{2(x1+x2)}-\frac{x2}{2(x1+x2)}\\
&=\frac{2(x1+x2)-x1-2x1-x2}{2(x1+x2)}\\
&=\frac{x2-x1}{2(x1+x2)}\\
&=\frac{|x2-x1|+|x1-x2|}{2*2^2\bar x}
\end{align}

Summary

・ I tried to find the Lorenz curve and Gini coefficient. ・ The Gini coefficient that reflects the student grade distribution was calculated. ・ The difference in score distribution between men and women is small

・ I want to prove the general formula of the Gini coefficient ・ I would like to apply it to investment efficiency (portfolio) such as stocks, exchange rates, and deposits and savings.

Recommended Posts

[Introduction to Scipy] Calculation of Lorenz curve and Gini coefficient ♬
[Statistics for programmers] Lorenz curve and Gini coefficient
[Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use graph drawing library ♬ Basics of Scipy
Calculation of standard deviation and correlation coefficient in Python
[Introduction to cx_Oracle] Overview of cx_Oracle
[Introduction to cx_Oracle] (Part 4) Fetch and scroll of result set
[Introduction to Data Scientists] Basics of Python ♬ Functions and classes
Introduction of DataLiner ver.1.3 and how to use Union Append
Introduction and tips of mlflow.Tracking
[Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use graph drawing library ♬ Basics of Pandas
[Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use graph drawing library ♬ Basics of Matplotlib
[Introduction to Python] I compared the naming conventions of C # and Python.
[Introduction to Udemy Python3 + Application] 69. Import of absolute path and relative path
[Introduction to pytorch-lightning] Autoencoder of MNIST and Cifar10 made from scratch ♬
[Introduction to Udemy Python3 + Application] 12. Indexing and slicing of character strings
Introduction to TensorFlow-Summary of four arithmetic operations and basic mathematical functions
[Introduction to cx_Oracle] (Part 2) Basics of connecting and disconnecting to Oracle Database
[Introduction to Data Scientists] Basics of Python ♬ Conditional branching and loops
[Introduction to Data Scientists] Basics of Python ♬ Functions and anonymous functions, etc.
Introduction and Implementation of JoCoR-Loss (CVPR2020)
[Introduction to Python3 Day 1] Programming and Python
Calculation of homebrew class and existing class
[Python] Calculation of Kappa (k) coefficient
Calculation of Spearman's rank correlation coefficient
Numerical calculation of lens point image distribution function and MTF curve (diffraction calculation)
[Introduction to Data Scientists] Basics of Probability and Statistics ♬ Probability / Random Variables and Probability Distribution
Introduction of cyber security framework "MITRE CALDERA": How to use and training
Introduction to Deep Learning ~ Convolution and Pooling ~
[Python] Calculation of image similarity (Dice coefficient)
[Introduction to AWS] Text-Voice conversion and playing ♪
Easy introduction of python3 series and OpenCV3
Introduction to Scapy ① (From installation to execution of Scapy)
[Introduction to Data Scientists] Basics of Python ♬