Header of "Chiyoda Ward (10KB)" in "Age (5 years old class, 4 categories), Gender-specific population" of "2010 National Census (Small Area) 2010/10/01" data of Ministry of Internal Affairs and Communications e-Stat Try to understand the meaning of. If you make a mistake, I would appreciate it if you could comment.
Basically, I think it should be written in the "definition document", but there are many places where the meaning of the definition document is not well understood. I can't find the need to explain "serial number" or "hierarchy".
Looking at the data, you can read it as if it were a column of MultiIndex
as follows.
0 | 1 |
---|---|
KEY_CODE | |
HYOSYO | |
CITYNAME | |
NAME | |
HTKSYORI | |
HTKSAKI | |
GASSAN | |
T000573001 | Total number, including age "unknown" |
T000573002 | Total number 0-4 years old |
T000573003 | 5-9 years old in total |
T000573004 | 10-14 years in total |
T000573005 | 15-19 years old in total |
T000573006 | 20-24 years old in total |
T000573007 | 25-29 years old in total |
T000573008 | 30-34 years old in total |
T000573009 | 35-39 years old in total |
T000573010 | 40-44 years old in total |
T000573011 | 45-49 years old in total |
T000573012 | 50-54 years old in total |
T000573013 | 55-59 years old in total |
T000573014 | Total 60-64 years old |
T000573015 | 65-69 years old in total |
T000573016 | 70-74 years old in total |
T000573017 | Under 15 years old |
T000573018 | 15-64 years in total |
T000573019 | 65 years old or older |
T000573020 | 75 years old or older |
T000573021 | Total number of men, including age "unknown" |
T000573022 | Male 0-4 years old |
T000573023 | Male 5-9 years old |
T000573024 | Male 10-14 years old |
T000573025 | Male 15-19 years old |
T000573026 | Man 20-24 years old |
T000573027 | Male 25-29 years old |
T000573028 | Male 30-34 years old |
T000573029 | Male 35-39 years old |
T000573030 | Male 40-44 years old |
T000573031 | Male 45-49 years old |
T000573032 | Male 50-54 years old |
T000573033 | Male 55-59 years old |
T000573034 | Male 60-64 years old |
T000573035 | Male 65-69 years old |
T000573036 | Male 70-74 years old |
T000573037 | Man under 15 years old |
T000573038 | Male 15-64 years old |
T000573039 | Male over 65 years old |
T000573040 | Male over 75 years old |
T000573041 | Total number of women, including age "unknown" |
T000573042 | Woman 0-4 years old |
T000573043 | Woman 5-9 years old |
T000573044 | Woman 10-14 years old |
T000573045 | Woman 15-19 years old |
T000573046 | Woman 20-24 years old |
T000573047 | Woman 25-29 years old |
T000573048 | Woman 30-34 years old |
T000573049 | Woman 35-39 years old |
T000573050 | Woman 40-44 years old |
T000573051 | Woman 45-49 years old |
T000573052 | Woman 50-54 years old |
T000573053 | Woman 55-59 years old |
T000573054 | Woman 60-64 years old |
T000573055 | Woman 65-69 years old |
T000573056 | Woman 70-74 years old |
T000573057 | Woman under 15 years old |
T000573058 | Woman 15-64 years old |
T000573059 | Woman over 65 years old |
T000573060 | Woman over 75 years old |
KEY_CODE
, HYOSYO
, CITYNAME
, NAME
, HTKSYORI
, HTKSAKI
, GASSAN
are not listed in the previous definition.
These meanings are important.
KEY_CODE
Administrative code. The size of the section changes according to the number of digits. The number of digits is 5, 9, 11, and it is constructed starting from the code system of the administrative code. Both the large compartment and the small compartments it contains also exist as rows. When present in parallel, the larger parcels contain the total value.
HYOSYO
The depth of the assembled code system of KEY_CODE
?
CITYNAME
Land name in the largest unit
NAME
Land name in the smallest unit
HTKSYORI
Whether "confidential processing" is being performed. There are two types, "secret area" and "total area", and these are in a mutual relationship.
HTKSAKI
A value is entered when HTKSYORI
is a" secret area ".
Use in concatenation with the first 5 characters of KEY_CODE
on the corresponding line.
GASSAN
A value is entered when HtKSYORI
is set to "There is a total area".
Use in concatenation with the first 5 characters of KEY_CODE
on the corresponding line.
When multiple regions are combined, multiple regions are concatenated with a semicolon.
pandas
Like this?
import pandas as pd
opts = dict(
header=[0,1],
converters={i:str for i in range(7)}
)
txt_ = pd.concat([pd.read_csv(open(f, encoding="CP932"), **opts) for f in glob.glob("tblT000573C27*.txt")])
txt_columns = txt_.columns
txt_.columns = [c[0] if c[1].startswith("Unnamed:") else c[1] for c in txt_.columns]
txt = pd.concat([txt_[txt_.columns[:7]],
txt_[txt_.columns[7:]].applymap(lambda s: 0 if s in ("-","X") else int(s))], axis=1)
for ri,r in txt.iterrows():
for s in r["GASSAN"].split(";"):
if not s:
continue
t = r["KEY_CODE"][:5]+s
assert txt[txt["KEY_CODE"]==t].shape[0] == 1
Recommended Posts