[PYTHON] [Statistics for programmers] Lorenz curve and Gini coefficient

table of contents

Statistics for Programmers-Table of Contents

Overview

It is assumed that you have already read the following articles. [Statistics for programmers] Frequency distribution and histogram

What is the Lorenz curve?

It is a graph created to check income and population bias.

As an example, we will investigate the population bias of villages in a region. The total number of villages is 100. The population of each village is as follows. It's hard to see, but the total population of all villages is listed.

population = [
124, 151, 102, 189, 160, 145, 120, 132, 135, 159, 114, 175, 171, 124, 154,
177, 152, 120, 144, 121, 113, 163, 186, 196, 183, 105, 130, 149, 130, 123, 
175, 143, 186, 182, 184, 174, 134, 158, 196, 109,
216, 285, 209, 288, 276, 281, 283, 200, 262, 
267, 235, 206, 245, 232, 299, 249, 295, 232, 206, 237,
369,
487, 450, 487,
597, 548,
682,
712, 706, 755, 700, 709, 747, 773, 796, 739, 716, 756, 767, 752, 728, 750,
829, 875, 845, 881, 865, 804, 845, 890, 872, 833, 874, 845, 859, 837, 847, 811, 893, 807
]

To find the Lorenz curve

The Lorenz curve requires two types of cumulative relative frequencies on the X-axis and Y-axis.

  1. Cumulative relative frequency of frequency of each class
  2. Cumulative relative frequency of the total of the values belonging to each class

I will explain each of them.

1. Cumulative relative frequency of frequency of each class

The first is to calculate the cumulative relative frequency, where the class is the population and the frequency is the number of villages.

Calculate in the same way as described in this article. [Statistics for programmers] Frequency distribution and histogram

class Class value frequency Relative frequency Cumulative relative frequency
Over 100 people-Less than 200 people 150 40 0.40 0.40
Over 200 people-Less than 300 people 250 20 0.20 0.60
Over 300 people-Less than 400 people 350 1 0.01 0.61
Over 400 people-Less than 500 people 450 3 0.03 0.64
Over 500 people-Less than 600 people 550 2 0.02 0.66
Over 600 people-Less than 700 people 650 1 0.01 0.67
Over 700 people-Less than 800 people 750 15 0.15 0.82
Over 800 people-Less than 900 people 850 18 0.18 1.00

2. Cumulative relative frequency of the total of the values belonging to each class

The second is to calculate the cumulative relative frequency, where the class is the population and the frequency is the total number of villages for each class.

To get the total number of villages for each class, calculate based on the value of the list of variables named population above. For example, the total number of "400 or more-less than 500" is 1424 as shown below.

1424 = 487 + 450 + 487

Also, since the total population of all villages is 41029 (the sum of all population values), the relative frequency is In the case of "400 or more-less than 500", it will be 0.03 (rounded down to the third decimal place).

0.03 = \frac{1424}{41029}

Each class is calculated in this way, and the cumulative relative frequency of the number of people is calculated.

Frequency distribution table

It is a frequency distribution table that summarizes the first and second mentioned above. (Relative frequency and cumulative relative frequency are rounded down to the third decimal place)

class Class value frequency Relative frequency Cumulative relative frequency of frequency Total number of people (frequency) Relative frequency of the number of people Cumulative relative frequency of the number of people
Over 100 people-Less than 200 people 150 40 0.40 0.40 5988 0.14 0.15
Over 200 people-Less than 300 people 250 20 0.20 0.60 5003 0.12 0.27
Over 300 people-Less than 400 people 350 1 0.01 0.61 369 0.01 0.28
Over 400 people-Less than 500 people 450 3 0.03 0.64 1424 0.03 0.31
Over 500 people-Less than 600 people 550 2 0.02 0.66 1145 0.02 0.34
Over 600 people-Less than 700 people 650 1 0.01 0.67 682 0.01 0.36
Over 700 people-Less than 800 people 750 15 0.15 0.82 11106 0.27 0.63
Over 800 people-Less than 900 people 850 18 0.18 1.00 15312 0.37 1.00

The Lorenz curve is a graph with the "cumulative relative frequency of frequency" on the x-axis and the "cumulative relative frequency of the number of people" on the y-axis.

Draw a Lorenz curve

Graph the above data using matplotlib.

Reference: Draw graph with jupyter (ipython notebook) + matplotlib + vagrant

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

#Abscissa,Cumulative relative frequency of frequency
left = np.array([
  0,
  0.40,
  0.60,
  0.61,
  0.64,
  0.66,
  0.67,
  0.82,
  1.00,
])
#Vertical coordinates,Cumulative relative frequency of the number of people
height = np.array([
  0,
  0.15,
  0.27,
  0.28,
  0.31,
  0.34,
  0.36,
  0.63,
  1.00
])
ax.plot(left, height, marker='o')
#Graph title
plt.title('Lorenz Curve')
#X-axis title
plt.xlabel('villages')
#Y-axis title
plt.ylabel('population')

left = np.array([0, 0.2, 0.4, 0.6, 0.8, 1.0])
height = np.array([0, 0.2, 0.4, 0.6, 0.8, 1.0])
ax.plot(left, height, linestyle='dashed', color='black')

#X-axis label
label = [0, 0.2, 0.4, 0.6, 0.8, 1.0] 

#X-axis location to hit the label
ax.set_xticks(label)
#Set label on X axis
ax.set_xticklabels(label)

#Location of the Y-axis to hit the label
ax.set_yticks(label)
#Set label on Y axis
ax.set_yticklabels(label)

#drawing
plt.show()

download.png

This blue curve is the Lorenz curve.

What you can see by looking at the Lorenz curve

I will explain how to look at the Lorenz curve created above.

download2.png

Let's look at the coordinates of the part drawn with the red line. What we can see from this point is that about 65% of the villages are only about 36% of the total population. This means that the remaining 64% of the population is biased towards some other village. Also, the larger the bulge of the curve, the more biased it is.

Gini coefficient

The numerical value of this bias is called the Gini coefficient. Regarding how to calculate the Gini coefficient, first, the broken line in the above graph is called the "perfect equality line". The Gini coefficient is obtained by doubling the area between this "perfect equality line" and the Lorenz curve.

The Gini coefficient is a value from 0 to 1, and the closer it is to 1, the greater the bias, and the closer it is to 0, the less the bias.

By the way, when the Gini coefficient is 0, the Lorenz curve overlaps the perfect equality line.

Calculate the Gini coefficient

Actually calculate the Gini coefficient. Here, for ease of calculation, the following values are used instead of the above example.

X axis
0, 0.4, 0.8, 1

Y axis
0, 0.2, 0.6, 1.0

From this value, the following Lorenz curve is created.

download (1).png

The Gini coefficient can be obtained by dividing this graph as shown below and subtracting the area of parts 1, 2 and 3 in the figure from the triangle with the hypotenuse as the hypotenuse and doubling it.

download (1).png

First, find the area of the triangle with the hypotenuse of the perfect equality line.

0.5 = 1\times1\div2

Find the area of part 1. Since it is a right triangle with a base of 0.4 and a height of 0.2, it can be calculated as follows.

0.04 = 0.4\times0.2\div2

Find the area of part 2. When rotated to the right, it becomes a trapezoid with an upper base of 0.2, a lower base of 0.6, and a height of 0.4, which can be calculated as follows.

0.16 = (0.2+0.6)\times0.4\div2

Find the area of part 3. If you rotate it to the right, it will become a trapezoid with an upper base of 0.6, a lower base of 1, and a height of 0.2, so you can find it below.

0.16 = (0.6+1)\times0.2\div2

If you subtract these three values from the area of the triangle whose hypotenuse is the perfect equality line and double them, you will get the Gini coefficient, so 0.28 is the Gini coefficient as shown in the formula below.

0.28 = (0.5 - 0.04 - 0.16 - 0.16)\times2

that's all

reference

-Statistics WEB --Lorenz curve -Lorenz curve and Gini coefficient-Mathematics learned with concrete examples -Calculate the Gini coefficient with Python

Recommended Posts

[Statistics for programmers] Lorenz curve and Gini coefficient
[Statistics for programmers] Variance, standard deviation and coefficient of variation
[Introduction to Scipy] Calculation of Lorenz curve and Gini coefficient ♬
[Statistics for programmers] Conditional probabilities and multiplication theorems
[Statistics for programmers] Bayes' theorem
[Statistics for programmers] Box plot
[Statistics for programmers] Random variables, probability distributions, and probability density functions
[Statistics for programmers] Mean, median, mode
[Statistics for programmers] What is an event?
[Statistics for Programmers] Table of Contents-Data Science
Fuzzing and statistics