[PYTHON] Let's visualize the relationship between average salary and industry with XBRL data and seaborn! (7/10)

In the 7th post of Advent Calendar, we will programmatically visualize the dataset published in the 6th post (data cleansing of the information extracted from the data described in XBRL).

(The program in this article is provided as it is without any guarantee, and XBRL Japan takes all responsibility for any disadvantages or problems caused by using this program, regardless of the cause. I will not bear it.)

1. Visualize XBRL data

1.1 Program Overview

This program is published in "Let's automatically extract employee information such as average salary from XBRL data disclosed by EDINET (data set sharing) (6/10)" (Average annual salary ( average annual salary (A Python language program that visualizes datasets of), average years of service (years), and average age (years)) with the matplotlib` library. (All codes are listed in "2. Source code") We chose box plots, bar charts, and kernel density estimation as the visualization method. Companies that do not have all the data such as salary information are excluded from this processing as missing data.

1.2 Preparation

Please take the following actions before executing the program. Installation of other libraries (japanize_matplotlib, seaborn, etc.) must also be performed in advance. Also, change the encoding and path specification method as appropriate according to the execution environment. Then, change the sep attribute of read_csv to sep =',', or sep ='/ t' as appropriate depending on the CSV creation method.

1.2.1 Determining the dataset file path

Determines the path to read the data cleansed dataset (CSV format) file. dataset_filepath ='C:\\Users\\xxx\\Desktop\\xbrlReport\\xbrl_cleansing_qiita.csv'

df_dataset = pd.read_csv(dataset_filepath, header=0,sep=',', engine="python")

1.2.2 Determining the output file path

Determine the file path to output the image file. plt.savefig('C:\\Users\\XXX\\Desktop\\xbrlReport\\XXXX.png')

1.3 Execution result

When executed, the result of visualizing three types of KDEs is a box plot (average annual salary by industry, average years of service by industry, average age by industry), bar graph (average annual salary TOP50 (information and communication industry)). It is output as an image.

1.3.1 Box plot

Regarding the boxplot of average annual salary by industry, first, the given data set was grouped by industry, and the average annual salary was calculated for each industry. After that, we sorted the average annual salary by industry and created a sorted industry name list of average annual salary from the results obtained.

Code1


df_groupby_mean =dropped_dataset.groupby(['Industry'], as_index=False).mean()
df_groupby_mean_s_by_salary = df_groupby_mean.sort_values('Average annual salary (yen)')
df_gyoshu_label_by_salary = df_groupby_mean_s_by_salary['Industry']
gyoshu_label_list_by_salary=df_gyoshu_label_by_salary.tolist()

By applying the list in the order of display on the Y-axis during visualization, it is possible to visualize the average annual salary by industry. A box plot using boxenplot allows you to visually check the difference between the mean and median values. In addition, by plotting the points with stripplot, we also visualized the scatter. The average years of service and average age are also created by performing the same processing.

Code2


sns.stripplot(x='Average annual salary (yen)', y ='Industry',orient = 'h', data=df_dropped_dataset,size=3,edgecolor="gray",order=gyoshu_label_list_by_salary)
ax =sns.boxenplot(x='Average annual salary (yen)', y='Industry',orient = 'h', data=df_dropped_dataset,palette='rainbow',order=gyoshu_label_list_by_salary);
Box_beard1.png Box_beard2.png Box_beard3.png

1.3.2 Bar chart

Only companies in the information and telecommunications industry were extracted, and then the top 50 companies were extracted from the results of sorting by average annual salary (yen). (I chose the information / communication industry in consideration of the Qiita viewers)

Code3


df_info=df_dropped_dataset[df_dropped_dataset["Industry"] == "Information and communication industry"]
df_info_sortby_salary = df_info.sort_values('Average annual salary (yen)')[-50:]

As in the case of the boxplot, the company names sorted by salary are listed and used in the order of display on the X-axis. After that, you can draw the top 50 companies with average annual salary just by passing the information / communication industry data and the sorted list to barplot.

Code4



df_info_label_list_sortby_salary=df_info_sortby_salary['Company name'].tolist()
ax =sns.barplot(x="Company name", y="Average annual salary (yen)",  data=df_info,palette='rocket',order=df_info_label_list_sortby_salary)

By changing [" industry "] ==" information / communication industry " to your favorite industry (fisheries / agriculture / forestry, service industry, etc.), you can visualize the TOP 50 of other industries. You can also control how many TOPs are drawn by changing the value of [-50:]. And, although some customization is required, you can also get filtered results by adding average years of service (years) andaverage age (years)other than the average annual salary. ..

output5.png

1.3.3 Kernel density estimation

I drew a KDE that combines three patterns: average years of service-average annual salary, average age-average annual salary, and average age-average years of service. Unlike by industry, you can get a general overview of the entire company submitted to EDINET from a bird's-eye view.

Code5



sns.jointplot('Average years of service (years)', 'Average annual salary (yen)',data=df_dropped_dataset, kind="kde",color="#d9f2f8")
sns.jointplot('Average age (years)', 'Average annual salary (yen)',data=df_dropped_dataset, kind="kde",color="#fac8be")
sns.jointplot('Average age (years)', 'Average years of service (years)',data=df_dropped_dataset, kind="kde",color="#d6fada")

KDE1.png KDE2.png KDE3.png

2. Source code

# -*- coding: utf-8 -*-

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
import japanize_matplotlib # https://yolo.love/matplotlib/japanese/

def drop_unnecessary_data(dataset_filepath):
    df_dataset = pd.read_csv(dataset_filepath, header=0,sep=',', engine="python")

    df_dropped_dataset = df_dataset.dropna()
    df_dropped_dataset = df_dropped_dataset.drop('#', axis=1)
    df_dropped_dataset = df_dropped_dataset.drop('EDINETCODE', axis=1)

    print(df_dropped_dataset)
    print(df_dropped_dataset.info())
    print('Number of lines',len(df_dropped_dataset))

    return df_dropped_dataset

def make_label_list(dropped_dataset):

    df_groupby_mean =dropped_dataset.groupby(['Industry'], as_index=False).mean()

    df_groupby_mean_s_by_salary = df_groupby_mean.sort_values('Average annual salary (yen)')
    df_gyoshu_label_by_salary = df_groupby_mean_s_by_salary['Industry']
    gyoshu_label_list_by_salary=df_gyoshu_label_by_salary.tolist()

    df_groupby_mean_s_by_service = df_groupby_mean.sort_values('Average years of service (years)')
    df_gyoshu_label_by_service = df_groupby_mean_s_by_service['Industry']
    gyoshu_label_list_by_service=df_gyoshu_label_by_service.tolist()

    df_groupby_mean_s_by_age = df_groupby_mean.sort_values('Average age (years)')
    df_gyoshu_label_by_age = df_groupby_mean_s_by_age['Industry']
    gyoshu_label_list_by_age=df_gyoshu_label_by_age.tolist()

    return gyoshu_label_list_by_salary,gyoshu_label_list_by_service,gyoshu_label_list_by_age

def visualize_boxenplot_salary(gyoshu_label_list_by_salary,df_dropped_dataset):
    plt.figure(figsize=(15, 10))

    sns.stripplot(x='Average annual salary (yen)', y ='Industry',orient = 'h', data=df_dropped_dataset,size=3,edgecolor="gray",order=gyoshu_label_list_by_salary)
    ax =sns.boxenplot(x='Average annual salary (yen)', y='Industry',orient = 'h', data=df_dropped_dataset,palette='rainbow',order=gyoshu_label_list_by_salary)

    ax.grid(which = "major",color='lightgray',ls=':',alpha = 0.5)
    ax.xaxis.set_minor_locator(ticker.AutoMinorLocator())

    ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))

    plt.xlabel("Average annual salary (yen)", fontsize=18)
    plt.ylabel("Industry",fontsize=16)

    plt.title("Average annual salary by industry", fontsize=24)

    plt.gca().spines['right'].set_visible(False)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().yaxis.set_ticks_position('left')
    plt.gca().xaxis.set_ticks_position('bottom')

    plt.savefig('C:\\Users\\xxx\\Desktop\\xbrlReport\\boxenplot_1.png')

    plt.show()

def visualize_boxenplot_service(gyoshu_label_list_by_service,df_dropped_dataset):
    plt.figure(figsize=(15, 10))

    sns.stripplot(x='Average years of service (years)', y ='Industry',orient = 'h', data=df_dropped_dataset,size=3,edgecolor="gray",order=gyoshu_label_list_by_service)
    ax =sns.boxenplot(x='Average years of service (years)', y='Industry',orient = 'h', data=df_dropped_dataset,palette='rainbow',order=gyoshu_label_list_by_service)

    ax.grid(which = "major",color='lightgray',ls=':',alpha = 0.5)
    ax.xaxis.set_minor_locator(ticker.AutoMinorLocator())

    plt.xlabel("Average years of service (years)", fontsize=18)
    plt.ylabel("Industry",fontsize=16)

    plt.title("Average years of service by industry", fontsize=24)

    plt.gca().spines['right'].set_visible(False)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().yaxis.set_ticks_position('left')
    plt.gca().xaxis.set_ticks_position('bottom')

    plt.savefig("C:\\Users\\xxx\\Desktop\\xbrlReport\\boxenplot_2.png ")

    plt.show()

def visualize_boxenplot_age(gyoshu_label_list_by_age,df_dropped_dataset):
    plt.figure(figsize=(15, 10))

    sns.stripplot(x='Average age (years)', y ='Industry',orient = 'h', data=df_dropped_dataset,size=3,edgecolor="gray",order=gyoshu_label_list_by_age)
    ax =sns.boxenplot(x='Average age (years)', y='Industry',orient = 'h', data=df_dropped_dataset,palette='rainbow',order=gyoshu_label_list_by_age);

    ax.grid(which = "major",color='lightgray',ls=':',alpha = 0.5)
    ax.xaxis.set_minor_locator(ticker.AutoMinorLocator())

    plt.xlabel("Average age (years)", fontsize=18)
    plt.ylabel("Industry",fontsize=16)

    plt.title("Average age by industry", fontsize=24)

    plt.gca().spines['right'].set_visible(False)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().yaxis.set_ticks_position('left')
    plt.gca().xaxis.set_ticks_position('bottom')

    plt.savefig("C:\\Users\\xxx\\Desktop\\xbrlReport\\boxenplot_3.png ")

    plt.show()

def visualize_jointplot(df_dropped_dataset):
    sns.jointplot('Average years of service (years)', 'Average annual salary (yen)',data=df_dropped_dataset, kind="kde",color="#d9f2f8")
    plt.savefig("C:\\Users\\xxx\\Desktop\\xbrlReport\\jointplot_1.png ")
    plt.show()

    sns.jointplot('Average age (years)', 'Average annual salary (yen)',data=df_dropped_dataset, kind="kde",color="#fac8be")
    plt.savefig("C:\\Users\\xxx\\Desktop\\xbrlReport\\jointplot_2.png ")
    plt.show()

    sns.jointplot('Average age (years)', 'Average years of service (years)',data=df_dropped_dataset, kind="kde",color="#d6fada")
    plt.savefig("C:\\Users\\xxx\\Desktop\\xbrlReport\\jointplot_3.png ")
    plt.show()

def visualize_barplot(df_dropped_dataset):
    df_info=df_dropped_dataset[df_dropped_dataset["Industry"] == "Information and communication industry"]
    df_info_sortby_salary = df_info.sort_values('Average annual salary (yen)')[-50:]
    df_info_label_list_sortby_salary=df_info_sortby_salary['Company name'].tolist()

    plt.figure(figsize=(15, 12))
    ax =sns.barplot(x="Company name", y="Average annual salary (yen)",  data=df_info,palette='rocket',order=df_info_label_list_sortby_salary)
    sns.set(style="ticks")
    plt.xticks(rotation=90)

    plt.subplots_adjust(hspace=0.8,bottom=0.35)

    ax.grid(which = 'major',axis ='y', color='lightgray',ls=':',alpha = 0.5)

    ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))

    plt.xlabel("Company name", fontsize=12)
    plt.ylabel("Average annual salary (yen)",fontsize=18)

    plt.title("Information and communication industry:Average annual salary TOP50", fontsize=24)

    plt.gca().spines['right'].set_visible(False)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().yaxis.set_ticks_position('left')
    plt.gca().xaxis.set_ticks_position('bottom')

    plt.savefig("C:\\Users\\xxx\\Desktop\\xbrlReport\\barplot_1.png ")

    plt.show()

def main():
    dataset_filepath ='C:\\Users\\xxx\\Desktop\\xbrlReport\\xbrl_cleansing_qiita.csv'
    df_dropped_dataset = drop_unnecessary_data(dataset_filepath)

    gyoshu_label_list_by_salary, gyoshu_label_list_by_service, gyoshu_label_list_by_age = make_label_list(df_dropped_dataset)

    visualize_boxenplot_salary(gyoshu_label_list_by_salary,df_dropped_dataset)
    visualize_boxenplot_service(gyoshu_label_list_by_service,df_dropped_dataset)
    visualize_boxenplot_age(gyoshu_label_list_by_age,df_dropped_dataset)

    visualize_jointplot(df_dropped_dataset)

    visualize_barplot(df_dropped_dataset)

    print("visualize finish")

if __name__ == "__main__":
    main()


3. For analysis

This article focused on the visualization of raw data, but with a few modifications to this data, corporate analysis is possible. For example, if the average age-average years of service is around 25 years old, it is known that there are few retirees and mid-career hires, so it is possible to extract such companies. By adding company age here, a deeper analysis can be performed. (Company age can be obtained from XBRL data with the element name "jpcrp_cor: FiscalYearCoverPage") Companies with high corporate age and average years of service over 42 years old are aging. In the future, as older people retire in bulk, the average age will suddenly drop, and as a result, net income will suddenly rise. (Not because the profit margin of the company has increased, but because the total amount of salary paid has decreased, which is the heaviest in SG & A expenses.) In the future, the profit margin of all companies will increase due to the decrease in the age distribution of employees. It is interesting to pay attention to the distribution of employees, not just the changes in profit margins. In addition, it is also possible to compare and analyze the time series of sales ÷ number of employees (average of the previous term and this term) and number of employees. The reason for this is that per capita sales can be seen better due to the restructuring effect. In this way, the ability to obtain employee information from EDINET has expanded the range of corporate analysis. If you are interested in anything other than what is described here, please do a survey.

4. Contact information

For inquiries regarding this article, please contact the following e-mail address. e-mail:[email protected] (Of course, comments on qiita are also welcome)

This e-mail address will be the contact point for inquiries about the Development Committee of XBRL Japan, which writes the article for qiita. I will. Therefore, we cannot answer general inquiries about the organization depending on the content, but please feel free to contact us with any technical questions, opinions, requests, advice, etc. regarding XBRL. Please note that it may take some time to respond because the committee members are volunteers.

Recommended Posts

Let's visualize the relationship between average salary and industry with XBRL data and seaborn! (7/10)
[Statistics] Let's visualize the relationship between the normal distribution and the chi-square distribution.
Let's automatically extract employee information such as average salary from the XBRL data disclosed on EDINET (5/10)
The subtle relationship between Gentoo and pip
About the relationship between Git and GitHub
Relationship between Firestore and Go data type conversion
Interactively visualize data with TreasureData, Pandas and Jupyter.
Let's transpose the matrix with numpy and multiply the matrices.
Investigating the relationship between ice cream spending and temperature
Investigate the relationship between TensorFlow and Keras in transition
Relationship data learning with numpy and NetworkX (spectral clustering)
Visualize the range of interpolation and extrapolation with python
Visualize data and understand correlation at the same time
Overview and tips of seaborn with statistical data visualization
I examined the data mapping between ArangoDB and Java
Let's distinguish between data structure manipulation and logic code.
Get comments and subscribers with the YouTube Data API
Let's visualize the rainfall data released by Shimane Prefecture
Visualize data with Streamlit
Solve the spiral book (algorithm and data structure) with python!
Let's visualize the number of people infected with coronavirus with matplotlib
Let's move word2vec with Chainer and see the learning progress