[PYTHON] Create an age group with pandas

The values aggregated for each year are summarized for each 10 years. Set the class with cut of pandas and aggregate with groupby.

The data used is the CSV format of the Excel of "Population by Age" published by the Statistics Bureau of the Ministry of Internal Affairs and Communications.

For ease of use, delete the description line at the top of the data, the note at the bottom, and the "100+" and "Unknown" lines. The adjusted file is population-by-age.csv.

Processed with pandas

First, load the numpy and pandas modules. I also added a setting to draw a graph in IPython.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.mpl_style = 'default'

Read the CSV file. Specify the first column as the index. After reading, check the data type.

df = pd.read_csv('population-by-age.csv', index_col='age')
print df.dtypes
y1920    int64
y1930    int64
y1940    int64
y1950    int64
y1960    int64
y1970    int64
y1980    int64
y1990    int64
y2000    int64
y2010    int64
dtype: object

In addition, let's check the beginning, end, and statistics. The display is omitted.

print df.head(3)
print df.tail(3)
print df.describe()

Use cut to set the class. If you want to change the class width, adjust the third argument of range. Include or do not include both ends of the class is specified as an option. Switch between the * include_lowest * and * right * options accordingly.

labels = [ "{0} - {1}".format(i, i + 9) for i in range(0, 100, 10) ]
c = pd.cut(df.index, np.arange(0, 101, 10),
           include_lowest=True, right=False,
           labels=labels)

print df.groupby(c).sum()
            y1920     y1930     y1940     y1950     y1960     y1970     y1980
0 - 9    14314635  16778220  17961607  20728122  17049068  16965066  18547450   
10 - 19  11520624  13340649  15816378  17267585  20326076  16921989  17231873   
20 - 29   8533259  10367140  11756837  13910662  16527810  19749434  16882381   
30 - 39   7020188   7798498   9370143  10250310  13555835  16578939  19973312   
40 - 49   5902331   6332741   7041270   8487529   9835689  13217564  16427887   
50 - 59   4074855   5046797   5446760   6137697   7842597   9230197  12813527   
60 - 69   2968342   2977915   3782574   4074610   5092019   6709761   8429928   
70 - 79   1378630   1478319   1541314   1967261   2518482   3401952   5059662   
80 - 89    236419    315624    338472    354836    638738    879221   1503633   
90 - 99     13657     13997     18567     16258     32043     65629    118391   

            y1990     y2000     y2010  
0 - 9    13959454  11925887  10882409  
10 - 19  18533872  14034777  11984392  
20 - 29  16870834  18211769  13720134  
30 - 39  16791465  16891475  18127846  
40 - 49  19676302  16716227  16774981  
50 - 59  15813274  19176162  16308233  
60 - 69  11848590  14841772  18247422  
70 - 79   6835747  10051176  12904315  
80 - 89   2665908   4147012   6768852  
90 - 99    286141    688769   1318463  

So, I was able to aggregate the values aggregated for each year of age every 10 years.

Aggregate functions can be specified in addition to sum, and multiple aggregate functions can be specified. Let's check the following results.

print df.groupby(c).agg(['count', 'min', 'max', 'mean', 'std'])

Graph drawing

Since it is difficult to understand the relationship with only the above numbers, make a graph to get an overview of the numbers. Try arranging them side by side to compare the * stacked * options when drawing.

fig, axes = plt.subplots(ncols=2)
df.groupby(c).sum().plot(kind='bar', ax=axes[0])
df.groupby(c).sum().T.plot(kind='bar', stacked=True, ax=axes[1])

population-10year-bar.png

Looking at the numbers by 10 years old, the population over 60 years old is increasing more recently. On the other hand, we can see that the youth population is declining. If you look at the stacked graphs, you can see that the population has been steadily increasing from 1920 to 2000, but has been declining through 2010. As for the generation distribution, the ratio of the upper part of the bar in the graph is increasing.

Now that we've aggregated the general trends, we'll draw each series in the original data frame. If you simply plot it, it will be messy, so let's draw it as a separate graph for each year. This time the ʻaxes` variable is two-dimensional, so be careful when specifying the array index.

fig, axes = plt.subplots(nrows=5, ncols=2)
for i, y in enumerate(['y1920', 'y1930', 'y1940', 'y1950', 'y1960']):
    df[y].plot(ax=axes[i, 0])
    axes[i, 0].set_title(y)
    if y != 'y1960':
        axes[i, 0].get_xaxis().set_visible(False)
for i, y in enumerate(['y1970', 'y1980', 'y1990', 'y2000', 'y2010']):
    df[y].plot(ax=axes[i, 1])
    axes[i, 1].set_title(y)
    if y != 'y2010':
        axes[i, 1].get_xaxis().set_visible(False)

population-10year-transition.png

If you look at the individual graphs, you can see the impact of the baby boom. You can also see that the number of births has decreased since the second baby boom, and that the base of the elderly has expanded (lifespan has been extended) since 1970.

Recommended Posts

Create an age group with pandas
Create an environment with virtualenv
Create an API with Django
Standardize by group with pandas
Create an Excel file with Python3
Dynamically create new dataframes with pandas
Manipulating strings with pandas group by
Feature generation with pandas group by
I get an error with import pandas.
Create an application by classifying with Pygame
Create an image processing viewer with PySimpleGUI
Quickly create an excel file with Python #python
Create an update screen with Django Updateview
[Python] Quickly create an API with Flask
Create an add-in-enabled Excel instance with xlwings
Create an English word app with python
Create an upgradeable msi file with cx_Freeze
Create an app that guesses students with python
Create an academic society program with combinatorial optimization
Create an image composition app with Flask + Pillow
[Python pandas] Create an empty DataFrame from an existing DataFrame
Create an image with characters in python (Japanese)
Create an API server quickly with Python + Falcon
Quickly visualize with Pandas
Processing datasets with pandas (1)
Bootstrap sampling with Pandas
Convert 202003 to 2020-03 with pandas
Processing datasets with pandas (2)
Extract N samples for each group with Pandas DataFrame
Learn Pandas with Cheminformatics
An easy way to create an import module with jupyter
Create an app that guesses students with python-GUI version
Create a random number with an arbitrary probability density
Data visualization with pandas
[Xlsxwriter] Create conditional formatting Excel sheet with pandas + xlsxwriter [pandas] Memo
Create games with Pygame
Create filter with scipy
Data manipulation with Pandas!
Create an OpenAI Gym environment with bash on Windows 10
Shuffle data with pandas
Type after reading an excel file with pandas read_excel
Create an environment for test automation with AirtestIDE (Tips)
[In 3 lines] Plot the population pyramid (bar graph of age group / gender) with Pandas alone
Create a new csv with pandas based on the local csv
Create an environment for "Deep Learning from scratch" with Docker
Create an LCD (16x2) game with Raspberry Pi and Python
Minimum Makefile and buildout.cfg to create an environment with buildout
Create an example game-like stage with just the Blender 2.80 script
Let's create an app that authenticates with OIDC with Azure AD
I'm trying to create an authentication / authorization process with Django
Create an authentication feature with django-allauth and CustomUser in Django
Create a Todo app with Django ① Build an environment with Docker
[Python Kivy] How to create an exe file with pyinstaller
How to read an Excel file (.xlsx) with Pandas [Python]
How to create dataframes and mess with elements in pandas
I tried to create an article in Wiki.js with SQLAlchemy
Read csv with python pandas
Load nested json with pandas
Create Cloud TPU with tf-nightly
Create / search / create table with PynamoDB
Create 3d gif with python3