[PYTHON] How to read e-Stat subregion data

Header of "Chiyoda Ward (10KB)" in "Age (5 years old class, 4 categories), Gender-specific population" of "2010 National Census (Small Area) 2010/10/01" data of Ministry of Internal Affairs and Communications e-Stat Try to understand the meaning of. If you make a mistake, I would appreciate it if you could comment.

Basically, I think it should be written in the "definition document", but there are many places where the meaning of the definition document is not well understood. I can't find the need to explain "serial number" or "hierarchy".

Looking at the data, you can read it as if it were a column of MultiIndex as follows.

0 1
KEY_CODE
HYOSYO
CITYNAME
NAME
HTKSYORI
HTKSAKI
GASSAN
T000573001 Total number, including age "unknown"
T000573002 Total number 0-4 years old
T000573003 5-9 years old in total
T000573004 10-14 years in total
T000573005 15-19 years old in total
T000573006 20-24 years old in total
T000573007 25-29 years old in total
T000573008 30-34 years old in total
T000573009 35-39 years old in total
T000573010 40-44 years old in total
T000573011 45-49 years old in total
T000573012 50-54 years old in total
T000573013 55-59 years old in total
T000573014 Total 60-64 years old
T000573015 65-69 years old in total
T000573016 70-74 years old in total
T000573017 Under 15 years old
T000573018 15-64 years in total
T000573019 65 years old or older
T000573020 75 years old or older
T000573021 Total number of men, including age "unknown"
T000573022 Male 0-4 years old
T000573023 Male 5-9 years old
T000573024 Male 10-14 years old
T000573025 Male 15-19 years old
T000573026 Man 20-24 years old
T000573027 Male 25-29 years old
T000573028 Male 30-34 years old
T000573029 Male 35-39 years old
T000573030 Male 40-44 years old
T000573031 Male 45-49 years old
T000573032 Male 50-54 years old
T000573033 Male 55-59 years old
T000573034 Male 60-64 years old
T000573035 Male 65-69 years old
T000573036 Male 70-74 years old
T000573037 Man under 15 years old
T000573038 Male 15-64 years old
T000573039 Male over 65 years old
T000573040 Male over 75 years old
T000573041 Total number of women, including age "unknown"
T000573042 Woman 0-4 years old
T000573043 Woman 5-9 years old
T000573044 Woman 10-14 years old
T000573045 Woman 15-19 years old
T000573046 Woman 20-24 years old
T000573047 Woman 25-29 years old
T000573048 Woman 30-34 years old
T000573049 Woman 35-39 years old
T000573050 Woman 40-44 years old
T000573051 Woman 45-49 years old
T000573052 Woman 50-54 years old
T000573053 Woman 55-59 years old
T000573054 Woman 60-64 years old
T000573055 Woman 65-69 years old
T000573056 Woman 70-74 years old
T000573057 Woman under 15 years old
T000573058 Woman 15-64 years old
T000573059 Woman over 65 years old
T000573060 Woman over 75 years old

KEY_CODE, HYOSYO, CITYNAME, NAME, HTKSYORI, HTKSAKI, GASSAN are not listed in the previous definition. These meanings are important.

KEY_CODE

Administrative code. The size of the section changes according to the number of digits. The number of digits is 5, 9, 11, and it is constructed starting from the code system of the administrative code. Both the large compartment and the small compartments it contains also exist as rows. When present in parallel, the larger parcels contain the total value.

HYOSYO

The depth of the assembled code system of KEY_CODE?

CITYNAME

Land name in the largest unit

NAME

Land name in the smallest unit

HTKSYORI

Whether "confidential processing" is being performed. There are two types, "secret area" and "total area", and these are in a mutual relationship.

HTKSAKI

A value is entered when HTKSYORI is a" secret area ". Use in concatenation with the first 5 characters of KEY_CODE on the corresponding line.

GASSAN

A value is entered when HtKSYORI is set to "There is a total area". Use in concatenation with the first 5 characters of KEY_CODE on the corresponding line. When multiple regions are combined, multiple regions are concatenated with a semicolon.

pandas

Like this?

import pandas as pd

opts = dict(
    header=[0,1],
    converters={i:str for i in range(7)}
)
txt_ = pd.concat([pd.read_csv(open(f, encoding="CP932"), **opts) for f in glob.glob("tblT000573C27*.txt")])
txt_columns = txt_.columns
txt_.columns =  [c[0] if c[1].startswith("Unnamed:") else c[1] for c in txt_.columns]
txt = pd.concat([txt_[txt_.columns[:7]],
    txt_[txt_.columns[7:]].applymap(lambda s: 0 if s in ("-","X") else int(s))], axis=1)

for ri,r in txt.iterrows():
    for s in r["GASSAN"].split(";"):
        if not s:
            continue
        t = r["KEY_CODE"][:5]+s
        assert txt[txt["KEY_CODE"]==t].shape[0] == 1

Recommended Posts

How to read e-Stat subregion data
How to read problem data with paiza
How to read PyPI
How to read JSON
[Python] How to read data from CIFAR-10 and CIFAR-100
How to read time series data in PyTorch
How to handle data frames
[Python] How to FFT mp3 data
How to deal with imbalanced data
How to deal with imbalanced data
How to read the SNLI dataset
How to Data Augmentation with PyTorch
How to collect machine learning data
How to read pydoc on python interpreter
How to use "deque" for Python data
How to handle time series data (implementation)
How to read CSV files in Pandas
Books on data science to read in 2020
Read pandas data
[SQLAlchemy] Read data
e-Stat GIS data
[python] Read data
How to create sample CSV data with hypothesis
How to read a CSV file with Python 2/3
[Python] How to read excel file with pandas
[Django] How to get data by specifying SQL.
How to scrape horse racing data with BeautifulSoup
How to use data analysis tools for beginners
[Introduction to Python] How to handle JSON format data
How to read an array with Python's ConfigParser
How to get article data using Qiita API
How to create data to put in CNN (Chainer)
I read "How to make a hacking lab"
How to search HTML data using Beautiful Soup
Tensorflow, Tensorflow After all, which one (How to read Tensorflow)
Data cleaning How to handle missing and outliers
How to read a file in a different directory
Summary of how to read numerical data with python [CSV, NetCDF, Fortran binary]
Read the Python-Markdown source: How to create a parser
How to use xml.etree.ElementTree
How to use virtualenv
Scraping 2 How to scrape
How to use Seaboan
How to apply markers only to specific data in matplotlib
How to use image-match
How to use shogun
How to install Python
How to use Pandas 2
[For beginners] How to study Python3 data analysis exam
How to scrape image data from flickr with python
How to install pip
How to use Virtualenv
How to use numpy.vectorize
How to update easy_install
How to install archlinux
How to scrape horse racing data using pandas read_html
How to quickly create array sample data during coding
How to use pytest_report_header
How to convert horizontally held data to vertically held data with pandas
How to restart gunicorn
How to install python