In the 7th post of Advent Calendar, we will programmatically visualize the dataset published in the 6th post (data cleansing of the information extracted from the data described in XBRL).

(The program in this article is provided as it is without any guarantee, and XBRL Japan takes all responsibility for any disadvantages or problems caused by using this program, regardless of the cause. I will not bear it.)

1. Visualize XBRL data

1.1 Program Overview

This program is published in "Let's automatically extract employee information such as average salary from XBRL data disclosed by EDINET (data set sharing) (6/10)" (Average annual salary ( average annual salary (A Python language program that visualizes datasets of), average years of service (years), and average age (years)) with the matplotlib` library. (All codes are listed in "2. Source code") We chose box plots, bar charts, and kernel density estimation as the visualization method. Companies that do not have all the data such as salary information are excluded from this processing as missing data.

1.2 Preparation

Please take the following actions before executing the program. Installation of other libraries (japanize_matplotlib, seaborn, etc.) must also be performed in advance. Also, change the encoding and path specification method as appropriate according to the execution environment. Then, change the sep attribute of read_csv to sep =',', or sep ='/ t' as appropriate depending on the CSV creation method.

1.2.1 Determining the dataset file path

Determines the path to read the data cleansed dataset (CSV format) file. dataset_filepath ='C:\\Users\\xxx\\Desktop\\xbrlReport\\xbrl_cleansing_qiita.csv'

df_dataset = pd.read_csv(dataset_filepath, header=0,sep=',', engine="python")

Depending on the environment, it may be necessary to set sep ='\ t' instead of sep ='.

1.2.2 Determining the output file path

Determine the file path to output the image file. plt.savefig('C:\\Users\\XXX\\Desktop\\xbrlReport\\XXXX.png')

1.3 Execution result

When executed, the result of visualizing three types of KDEs is a box plot (average annual salary by industry, average years of service by industry, average age by industry), bar graph (average annual salary TOP50 (information and communication industry)). It is output as an image.

1.3.1 Box plot

Regarding the boxplot of average annual salary by industry, first, the given data set was grouped by industry, and the average annual salary was calculated for each industry. After that, we sorted the average annual salary by industry and created a sorted industry name list of average annual salary from the results obtained.

`Code1`


df_groupby_mean =dropped_dataset.groupby(['Industry'], as_index=False).mean()
df_groupby_mean_s_by_salary = df_groupby_mean.sort_values('Average annual salary (yen)')
df_gyoshu_label_by_salary = df_groupby_mean_s_by_salary['Industry']
gyoshu_label_list_by_salary=df_gyoshu_label_by_salary.tolist()

By applying the list in the order of display on the Y-axis during visualization, it is possible to visualize the average annual salary by industry. A box plot using boxenplot allows you to visually check the difference between the mean and median values. In addition, by plotting the points with stripplot, we also visualized the scatter. The average years of service and average age are also created by performing the same processing.

`Code2`


sns.stripplot(x='Average annual salary (yen)', y ='Industry',orient = 'h', data=df_dropped_dataset,size=3,edgecolor="gray",order=gyoshu_label_list_by_salary)
ax =sns.boxenplot(x='Average annual salary (yen)', y='Industry',orient = 'h', data=df_dropped_dataset,palette='rainbow',order=gyoshu_label_list_by_salary);

1.3.2 Bar chart

Only companies in the information and telecommunications industry were extracted, and then the top 50 companies were extracted from the results of sorting by average annual salary (yen). (I chose the information / communication industry in consideration of the Qiita viewers)

`Code3`


df_info=df_dropped_dataset[df_dropped_dataset["Industry"] == "Information and communication industry"]
df_info_sortby_salary = df_info.sort_values('Average annual salary (yen)')[-50:]

As in the case of the boxplot, the company names sorted by salary are listed and used in the order of display on the X-axis. After that, you can draw the top 50 companies with average annual salary just by passing the information / communication industry data and the sorted list to barplot.

`Code4`



df_info_label_list_sortby_salary=df_info_sortby_salary['Company name'].tolist()
ax =sns.barplot(x="Company name", y="Average annual salary (yen)",  data=df_info,palette='rocket',order=df_info_label_list_sortby_salary)

By changing [" industry "] ==" information / communication industry " to your favorite industry (fisheries / agriculture / forestry, service industry, etc.), you can visualize the TOP 50 of other industries. You can also control how many TOPs are drawn by changing the value of [-50:]. And, although some customization is required, you can also get filtered results by adding average years of service (years) andaverage age (years)other than the average annual salary. ..

1.3.3 Kernel density estimation

I drew a KDE that combines three patterns: average years of service-average annual salary, average age-average annual salary, and average age-average years of service. Unlike by industry, you can get a general overview of the entire company submitted to EDINET from a bird's-eye view.

`Code5`



sns.jointplot('Average years of service (years)', 'Average annual salary (yen)',data=df_dropped_dataset, kind="kde",color="#d9f2f8")
sns.jointplot('Average age (years)', 'Average annual salary (yen)',data=df_dropped_dataset, kind="kde",color="#fac8be")
sns.jointplot('Average age (years)', 'Average years of service (years)',data=df_dropped_dataset, kind="kde",color="#d6fada")

2. Source code

# -*- coding: utf-8 -*-

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
import japanize_matplotlib # https://yolo.love/matplotlib/japanese/

def drop_unnecessary_data(dataset_filepath):
    df_dataset = pd.read_csv(dataset_filepath, header=0,sep=',', engine="python")

    df_dropped_dataset = df_dataset.dropna()
    df_dropped_dataset = df_dropped_dataset.drop('#', axis=1)
    df_dropped_dataset = df_dropped_dataset.drop('EDINETCODE', axis=1)

    print(df_dropped_dataset)
    print(df_dropped_dataset.info())
    print('Number of lines',len(df_dropped_dataset))

    return df_dropped_dataset

def make_label_list(dropped_dataset):

    df_groupby_mean =dropped_dataset.groupby(['Industry'], as_index=False).mean()

    df_groupby_mean_s_by_salary = df_groupby_mean.sort_values('Average annual salary (yen)')
    df_gyoshu_label_by_salary = df_groupby_mean_s_by_salary['Industry']
    gyoshu_label_list_by_salary=df_gyoshu_label_by_salary.tolist()

    df_groupby_mean_s_by_service = df_groupby_mean.sort_values('Average years of service (years)')
    df_gyoshu_label_by_service = df_groupby_mean_s_by_service['Industry']
    gyoshu_label_list_by_service=df_gyoshu_label_by_service.tolist()

    df_groupby_mean_s_by_age = df_groupby_mean.sort_values('Average age (years)')
    df_gyoshu_label_by_age = df_groupby_mean_s_by_age['Industry']
    gyoshu_label_list_by_age=df_gyoshu_label_by_age.tolist()

    return gyoshu_label_list_by_salary,gyoshu_label_list_by_service,gyoshu_label_list_by_age

def visualize_boxenplot_salary(gyoshu_label_list_by_salary,df_dropped_dataset):
    plt.figure(figsize=(15, 10))

    sns.stripplot(x='Average annual salary (yen)', y ='Industry',orient = 'h', data=df_dropped_dataset,size=3,edgecolor="gray",order=gyoshu_label_list_by_salary)
    ax =sns.boxenplot(x='Average annual salary (yen)', y='Industry',orient = 'h', data=df_dropped_dataset,palette='rainbow',order=gyoshu_label_list_by_salary)

    ax.grid(which = "major",color='lightgray',ls=':',alpha = 0.5)
    ax.xaxis.set_minor_locator(ticker.AutoMinorLocator())

    ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))

    plt.xlabel("Average annual salary (yen)", fontsize=18)
    plt.ylabel("Industry",fontsize=16)

    plt.title("Average annual salary by industry", fontsize=24)

    plt.gca().spines['right'].set_visible(False)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().yaxis.set_ticks_position('left')
    plt.gca().xaxis.set_ticks_position('bottom')

    plt.savefig('C:\\Users\\xxx\\Desktop\\xbrlReport\\boxenplot_1.png')

    plt.show()

def visualize_boxenplot_service(gyoshu_label_list_by_service,df_dropped_dataset):
    plt.figure(figsize=(15, 10))

    sns.stripplot(x='Average years of service (years)', y ='Industry',orient = 'h', data=df_dropped_dataset,size=3,edgecolor="gray",order=gyoshu_label_list_by_service)
    ax =sns.boxenplot(x='Average years of service (years)', y='Industry',orient = 'h', data=df_dropped_dataset,palette='rainbow',order=gyoshu_label_list_by_service)

    ax.grid(which = "major",color='lightgray',ls=':',alpha = 0.5)
    ax.xaxis.set_minor_locator(ticker.AutoMinorLocator())

    plt.xlabel("Average years of service (years)", fontsize=18)
    plt.ylabel("Industry",fontsize=16)

    plt.title("Average years of service by industry", fontsize=24)

    plt.gca().spines['right'].set_visible(False)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().yaxis.set_ticks_position('left')
    plt.gca().xaxis.set_ticks_position('bottom')

    plt.savefig("C:\\Users\\xxx\\Desktop\\xbrlReport\\boxenplot_2.png ")

    plt.show()

def visualize_boxenplot_age(gyoshu_label_list_by_age,df_dropped_dataset):
    plt.figure(figsize=(15, 10))

    sns.stripplot(x='Average age (years)', y ='Industry',orient = 'h', data=df_dropped_dataset,size=3,edgecolor="gray",order=gyoshu_label_list_by_age)
    ax =sns.boxenplot(x='Average age (years)', y='Industry',orient = 'h', data=df_dropped_dataset,palette='rainbow',order=gyoshu_label_list_by_age);

    ax.grid(which = "major",color='lightgray',ls=':',alpha = 0.5)
    ax.xaxis.set_minor_locator(ticker.AutoMinorLocator())

    plt.xlabel("Average age (years)", fontsize=18)
    plt.ylabel("Industry",fontsize=16)

    plt.title("Average age by industry", fontsize=24)

    plt.gca().spines['right'].set_visible(False)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().yaxis.set_ticks_position('left')
    plt.gca().xaxis.set_ticks_position('bottom')

    plt.savefig("C:\\Users\\xxx\\Desktop\\xbrlReport\\boxenplot_3.png ")

    plt.show()

def visualize_jointplot(df_dropped_dataset):
    sns.jointplot('Average years of service (years)', 'Average annual salary (yen)',data=df_dropped_dataset, kind="kde",color="#d9f2f8")
    plt.savefig("C:\\Users\\xxx\\Desktop\\xbrlReport\\jointplot_1.png ")
    plt.show()

    sns.jointplot('Average age (years)', 'Average annual salary (yen)',data=df_dropped_dataset, kind="kde",color="#fac8be")
    plt.savefig("C:\\Users\\xxx\\Desktop\\xbrlReport\\jointplot_2.png ")
    plt.show()

    sns.jointplot('Average age (years)', 'Average years of service (years)',data=df_dropped_dataset, kind="kde",color="#d6fada")
    plt.savefig("C:\\Users\\xxx\\Desktop\\xbrlReport\\jointplot_3.png ")
    plt.show()

def visualize_barplot(df_dropped_dataset):
    df_info=df_dropped_dataset[df_dropped_dataset["Industry"] == "Information and communication industry"]
    df_info_sortby_salary = df_info.sort_values('Average annual salary (yen)')[-50:]
    df_info_label_list_sortby_salary=df_info_sortby_salary['Company name'].tolist()

    plt.figure(figsize=(15, 12))
    ax =sns.barplot(x="Company name", y="Average annual salary (yen)",  data=df_info,palette='rocket',order=df_info_label_list_sortby_salary)
    sns.set(style="ticks")
    plt.xticks(rotation=90)

    plt.subplots_adjust(hspace=0.8,bottom=0.35)

    ax.grid(which = 'major',axis ='y', color='lightgray',ls=':',alpha = 0.5)

    ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))

    plt.xlabel("Company name", fontsize=12)
    plt.ylabel("Average annual salary (yen)",fontsize=18)

    plt.title("Information and communication industry:Average annual salary TOP50", fontsize=24)

    plt.gca().spines['right'].set_visible(False)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().yaxis.set_ticks_position('left')
    plt.gca().xaxis.set_ticks_position('bottom')

    plt.savefig("C:\\Users\\xxx\\Desktop\\xbrlReport\\barplot_1.png ")

    plt.show()

def main():
    dataset_filepath ='C:\\Users\\xxx\\Desktop\\xbrlReport\\xbrl_cleansing_qiita.csv'
    df_dropped_dataset = drop_unnecessary_data(dataset_filepath)

    gyoshu_label_list_by_salary, gyoshu_label_list_by_service, gyoshu_label_list_by_age = make_label_list(df_dropped_dataset)

    visualize_boxenplot_salary(gyoshu_label_list_by_salary,df_dropped_dataset)
    visualize_boxenplot_service(gyoshu_label_list_by_service,df_dropped_dataset)
    visualize_boxenplot_age(gyoshu_label_list_by_age,df_dropped_dataset)

    visualize_jointplot(df_dropped_dataset)

    visualize_barplot(df_dropped_dataset)

    print("visualize finish")

if __name__ == "__main__":
    main()

3. For analysis

This article focused on the visualization of raw data, but with a few modifications to this data, corporate analysis is possible. For example, if the average age-average years of service is around 25 years old, it is known that there are few retirees and mid-career hires, so it is possible to extract such companies. By adding company age here, a deeper analysis can be performed. (Company age can be obtained from XBRL data with the element name "jpcrp_cor: FiscalYearCoverPage") Companies with high corporate age and average years of service over 42 years old are aging. In the future, as older people retire in bulk, the average age will suddenly drop, and as a result, net income will suddenly rise. (Not because the profit margin of the company has increased, but because the total amount of salary paid has decreased, which is the heaviest in SG & A expenses.) In the future, the profit margin of all companies will increase due to the decrease in the age distribution of employees. It is interesting to pay attention to the distribution of employees, not just the changes in profit margins. In addition, it is also possible to compare and analyze the time series of sales ÷ number of employees (average of the previous term and this term) and number of employees. The reason for this is that per capita sales can be seen better due to the restructuring effect. In this way, the ability to obtain employee information from EDINET has expanded the range of corporate analysis. If you are interested in anything other than what is described here, please do a survey.

4. Contact information

For inquiries regarding this article, please contact the following e-mail address. e-mail：[email protected] (Of course, comments on qiita are also welcome)

This e-mail address will be the contact point for inquiries about the Development Committee of XBRL Japan, which writes the article for qiita. I will. Therefore, we cannot answer general inquiries about the organization depending on the content, but please feel free to contact us with any technical questions, opinions, requests, advice, etc. regarding XBRL. Please note that it may take some time to respond because the committee members are volunteers.

[PYTHON] Let's visualize the relationship between average salary and industry with XBRL data and seaborn! (7/10)