[PYTHON] What I saw by analyzing the data of the engineer market

Introduction

Python Advent Calendar 2019 Day 25: Christmas_tree :.

On the same 25th day of last year's Python Advent Calendar 2018, I wrote an article entitled Statistics Learned in Python & Data Analysis Environment Created at Home. In the article, we obtained the data of the highest annual income presented from the 15th draft participation user ranking of Job Change Draft and performed a simple data analysis.

This article summarizes the results of a data analysis of the engineer market using data from the 22nd Draft Participating User Rankings.

It's about this time of the year, but let's take a look at the engineer market for 2020.

Scraping

The code I wrote last year only got the highest amount, so I modified it to scrape all the data from the user ranking data.

First, execute the following program to acquire the data required for data analysis (*). In addition, the environment of this article is operated by Raspberry Pi.

** (*) When scraping, check the terms of use of the target service and consider the load on the service side. ** **

#! /usr/bin/env python3
# -*- coding: utf-8 -*-

#Import modules required for scraping
import csv
import sys
sys.path.append('/home/pi/.local/lib/python3.5/site-packages/')
import time
import traceback

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException

#Options for using Headless Chrome
options = webdriver.chrome.options.Options()
options.add_argument('--headless')

#Driver settings
browser = webdriver.Chrome(executable_path="/usr/lib/chromium-browser/chromedriver", chrome_options=options)

#Login information
USER = “user”
PASS = “pass”

#Show login screen
url_login = "https://job-draft.jp/sign_in"
browser.get(url_login)
time.sleep(1)
print("I visited the login page") 

#Enter your email address and password in the form
e = browser.find_element_by_id("user_email")
e.clear()
e.find_element_by_id("user_email").send_keys(USER)

e = browser.find_element_by_id("user_password")
e.clear()
e.find_element_by_id("user_password").send_keys(PASS)

time.sleep(1)

#Submit form
e.find_element_by_xpath("//*[@id=\"new_user\"]/div[4]").click()
print("You are now logged")

#function
list = []
dict = {}
page = ""
last = 50

def get_user_data():
    url = "https://job-draft.jp/festivals/22/users?page="
    url = url+str(page)
    browser.get(url)
    count = 12
    if page == 49:
        count = 8
    num = 2
    while num < count:
        try:
            user = browser.find_elements_by_css_selector("#page-wrapper > div.wrapper-content > div > div > div.col-xs-12.col-sm-12.col-md-8.col-lg-8 > div.ibox > div > div > div:nth-child("+(str(num))+") > div > div.col-xs-3 > div:nth-child(2) > a > span")
            age = browser.find_elements_by_css_selector("#page-wrapper > div.wrapper-content > div > div > div.col-xs-12.col-sm-12.col-md-8.col-lg-8 > div.ibox > div > div > div:nth-child("+(str(num))+") > div > div.col-xs-3 > div:nth-child(3) > span")
            name = browser.find_elements_by_css_selector("#page-wrapper > div.wrapper-content > div > div > div.col-xs-12.col-sm-12.col-md-8.col-lg-8 > div.ibox > div > div > div:nth-child("+(str(num))+") > div > div.col-xs-9 > div.row > div.col-xs-4.col-sm-3.col-md-3.col-lg-3 > span.f-w-bold.u-font-ml")
            max_amount = browser.find_elements_by_css_selector("#page-wrapper > div.wrapper-content > div > div > div.col-xs-12.col-sm-12.col-md-8.col-lg-8 > div.ibox > div > div > div:nth-child("+(str(num))+") > div > div.col-xs-9 > div.row > div:nth-child(2) > span.f-w-bold.u-font-ml")
            cum_avg = browser.find_elements_by_css_selector("#page-wrapper > div.wrapper-content > div > div > div.col-xs-12.col-sm-12.col-md-8.col-lg-8 > div.ibox > div > div > div:nth-child("+(str(num))+") > div > div.col-xs-9 > div.row > div:nth-child(3) > span.f-w-bold.u-font-ml")
            ambition = browser.find_elements_by_css_selector("#page-wrapper > div.wrapper-content > div > div > div.col-xs-12.col-sm-12.col-md-8.col-lg-8 > div.ibox > div > div > div:nth-child("+(str(num))+") > div > div.col-xs-9 > div.u-m-t-5 > div:nth-child(1) > span.f-w-bold.u-font-mm")
        except NoSuchElementException:
            print("There was no element")
            sys.exit(1)
        for user, age, name, max_amount, cum_avg, ambition in zip(user, age, name, max_amount, cum_avg, ambition):
            user = user.text
            age = age.text
            name = name.text
            max_amount = max_amount.text
            max_amount = max_amount.replace('Ten thousand yen', '')
            cum_avg = cum_avg.text
            cum_avg = cum_avg.replace('Ten thousand yen', '')
            ambition = ambition.text
            print(user)
            print(age)
            print(name)
            print(max_amount)
            print(cum_avg)
            print(ambition)
            dict = {"user": user, "age": age, "name": name, "max_amount": max_amount, "cum_avg": cum_avg, "ambition": ambition }
            list.append(dict)
            with open('./user_ranking.csv', 'a') as f:
                writer = csv.writer(f)
                writer.writerow(dict.values())
            num += 1

def main():
    print("Start data scraping")
    global page
    global last
    try:
        if not page:
            get_user_data()
            page = 2
            time.sleep(3)
        #Loop to the last page
        while page < last:
            get_user_data()
            page += 1
            time.sleep(3)

    except Exception as e:
         traceback.print_exc()
         sys.exit(99)
    #Exit the driver and close all related windows
    browser.quit()
    print("Data scraping completed successfully")

#processing
if __name__ == '__main__':
    main()

Preprocessing

Read the CSV file obtained by executing the above program from Jupyter Notebook. If you don't set any arguments in pandas, the first line will be recognized as header. The acquired data does not have a header. If you specify header = None, the header will be added automatically with numbers, but for the sake of clarity, specify the header.

import numpy as np
import pandas as pd
#Read csv file
df = pd.read_csv("/tmp/user_ranking.csv", names=("age", "Maximum amount", "username", "Number of nominations", "ambition", "Cumulative average"))

Check the beginning of the data frame to make sure it's loaded.

df.head()

スクリーンショット 2019-12-21 11.38.09.png

Since the scraped data is saved as dict, the keys are out of order. It's a little hard to see, so I'll rearrange the keys to make it the same as the website.

#sort
df = df.loc[:, ["username","age",  "Number of nominations", "Maximum amount", "Cumulative average", "ambition"]]

It has been rearranged to make it easier to see. Now you are ready.

スクリーンショット 2019-12-21 11.42.47.png

Data analysis

The basic idea of statistics is as follows.

--All statistical phenomena have a probability distribution. ――For all statistical phenomena, instead of observing the population, observe the sample, and based on that, infer and analyze the characteristics of the population.

Statistics

The 22nd draft was attended by 486 people. Of these, 320 have been nominated and 166 have not. As I wrote in last year's article, the average value will change depending on whether "no appointment" is included or not.

Since it seems that the average annual income shown in the bidding results is calculated by those who exclude "no appointment", we will analyze it based on the data excluding "no appointment" this time as well.

Remove non-numeric columns from the data frame for statistics.

#Number of nominations other than 0
df_nominated = df[df['Number of nominations'] != 0]
#Delete non-numeric value
df_nominated = df_nominated.drop(['age', 'username', 'ambition'], axis=1)

Make sure it is a number only.

df_nominated.head()

スクリーンショット 2019-12-21 11.47.14.png

Next, check the data type. You can see that the maximum amount and the cumulative average are ** object **.

#Check data type
df_nominated.dtypes

スクリーンショット 2019-12-21 11.51.36.png

Since the statistic cannot be confirmed as it is, change it to int type. Change to int64 type according to the data type of the number of nominations.

#Data type conversion
df_nominated.loc[:, "Maximum amount"] = df_nominated.loc[:, "Maximum amount"].astype(np.int64)
df_nominated.loc[:, "Cumulative average"] = df_nominated.loc[:, "Cumulative average"].astype(np.int64)

スクリーンショット 2019-12-21 11.55.39.png

Now that the data types are int64, check the statistics. The average maximum amount is ** 6.7 million yen ** and the standard deviation is ** 1.2 million yen **. The market is moving from ** 5.5 million yen to 7.9 million yen **.

#View statistics
df_nominated.describe()

スクリーンショット 2019-12-21 11.58.57.png

Next, let's check the correlation coefficient. As you can see, there is a strong positive correlation between the maximum amount and the cumulative average.

#Show correlation coefficient
df_nominated.corr()

スクリーンショット 2019-12-21 12.01.37.png

Check the relationship of the data for each column with the scatterplot matrix. The size of the graph can be changed with figsize. The unit is inches.

#Show scatterplot matrix
%matplotlib inline
from pandas.plotting import scatter_matrix
_ = scatter_matrix(df_nominated, figsize=(8,6))

スクリーンショット 2019-12-21 12.03.24.png

histogram

Let's compare the maximum amount in the histogram. Extract the highest amount from the data frame.

#Sort in ascending order
df_aomount = df_nominated.sort_values(by="Maximum amount")
#Extract the highest amount from the data frame and add it to the list
amount_list = []
for a in df_aomount["Maximum amount"]:
    amount_list.append(a)

スクリーンショット 2019-12-21 12.10.40.png

You can see that there are many in the 6 million yen range.

#Display as a histogram
import matplotlib.pyplot as plt
plt.figure(figsize=(20,10))
plt.hist(amount_list, rwidth=10, bins=40)
plt.xlabel('Maximum amount')
plt.ylabel('Number of people')
plt.show()

スクリーンショット 2019-12-21 12.11.13.png

Histogram (plural)

Check the histogram separately for each age group.

Also, prepare the data from the data frame.

import re
#Sort in ascending order
df_age = df.sort_values(by="Maximum amount")
df_age = df_age[df_age['Number of nominations'] != 0]

s10_list = []
s20_list = []
s30_list = []
s40_list = []
s50_list = []
s60_list = []

#Extract by age from data frame and add to list
for age, amount in zip(df_age["age"], df_age["Maximum amount"]):
    if type(amount) is  str:
        amount = np.int64(amount)
    if re.match('10's', age):
        s10_list.append(amount)
    elif re.match('20's', age):
        s20_list.append(amount)
    elif re.match('30s', age):
        s30_list.append(amount)
    elif re.match('Forties', age):
        s40_list.append(amount)
    elif re.match('50s', age):
        s50_list.append(amount)
    elif re.match('60s', age):
        s60_list.append(amount)

スクリーンショット 2019-12-21 12.14.45.png

Display as a histogram.

#Display as a histogram (multiple)
fig, ax = plt.subplots(figsize=(10,8))
labels = ['10', '20', '30', '40', '50', '60']
ax.hist((s10_list, s20_list, s30_list, s40_list, s50_list, s60_list), label=labels)
plt.xlabel('Maximum amount')
plt.ylabel('Number of people')
ax.legend()
plt.show()

Although the number of specimens is originally small, there are no appointments for people in their 50s and 60s, and it can be seen that it is in their 20s and 30s that the market is increasing, although it is within the expected range.

スクリーンショット 2019-12-21 12.17.04.png

Box plot

Check the box plot. Box plots are great when you want to find outliers.

#Displayed as a box plot
fig, ax = plt.subplots(figsize=(10,8))
labels = ['10', '20', '30', '40', '50', '60']
ax.boxplot((s10_list, s20_list, s30_list, s40_list, s50_list, s60_list), labels=labels)
plt.xlabel('age')
plt.ylabel('Maximum amount')
plt.show()

You can see that the 10 million yen class is detected as an outlier. Is it interesting to have an annual income in your thirties? The difference in range is the largest in each age group.

The data tells us that the difference in annual income will widen in the 30s. You can also see that the minimum line for people in their 40s is 6 million yen.

スクリーンショット 2019-12-21 12.19.01.png

wordcloud Extract ambitions from scraped data and display them in the word cloud for visualization.

To use the word cloud, execute the following command to build the environment (*). This article uses ** fonts-takao-mincho ** for Japanese fonts.

(*) The environment of this article is Raspberry Pi

$ pip3 install wordcloud $ sudo apt-get install fonts-takao-mincho

Extract ambitions from the data frame and add them to the list.

#Extract ambitions from dataframes and add to list
word_list = []
for w in df["ambition"]:
    if type(w) is  float:
        w = str(w)
    word_list.append(w)

Group the elements of the list into a word object.

#Store in word object
word = ''
for w in word_list:
    word += w
      
word.replace(',', '')     

スクリーンショット 2019-12-21 12.44.14.png

Create a word cloud. At this time, if you want to display in Japanese, specify the font path.

from wordcloud import STOPWORDS, WordCloud
#Word cloud creation
wordcloud = WordCloud(width=1200, height=900, background_color='white', colormap='winter', font_path='/usr/share/fonts/truetype/fonts-japanese-mincho.ttf')

wordcloud.generate(word)
wordcloud.to_file('wordcloud.png')

スクリーンショット 2019-12-21 12.57.31.png

Display the created word cloud.

#Image display
from IPython.display import Image
Image("./wordcloud.png ")

スクリーンショット 2019-12-21 12.57.59.png

What I saw by analyzing the data of the engineer market

It was an essential and universal thing that never changed, ** learning knowledge and becoming an engineer **.

Data analysis 2

We will improve the above program a little and analyze other acquired data.

List of cumulative participating companies

We will use the data of the cumulative list of participating companies as a ranking to investigate what kind of companies are popular. Prepare the data frame by the same procedure.

The following sorts ** Nominations ** in descending order and displays the top 10. Since the number of nominations is large, you can see that the company is focusing on recruiting activities.

#Read csv file
df_com = pd.read_csv("/tmp/companies_list.csv", names=("Company name", "Shinshi degree", "name",  "Number of love calls", "Consent (acceptance rate)"))
#Sort the nominations in descending order and display the top 10
df_com.sort_values(by="name", ascending=False).head(10)

スクリーンショット 2019-12-21 13.08.04.png

(*) The degree of synthesis is set to NaN because the data attribute is ** hidden-xs.hidden-sm **, so this program was recognized as some kind of device and was displayed and could not be acquired.

The following sorts ** Love Calls ** in descending order and displays the top 10. Since the number of love calls is high, you can see that it is a popular company.

#Sort the number of love calls in descending order and display the top 10
df_com.sort_values(by="Number of love calls", ascending=False).head(10)

スクリーンショット 2019-12-21 13.18.04.png

Past bid results (number of participants / number of participating companies / total number of nominations)

Investigate past bid results. The scraped data is sorted in descending order of index because the 22nd data comes first.

#Read csv file
df_results = pd.read_csv("/tmp/past_bid_results.csv", names=("Central presentation annual income", "Times", "Number of participating companies", "The number of participants", "Total number of nominations", "Total annual income presented", "Average annual salary"))
#sort
df_results = df_results.loc[:, ["Times", "The number of participants", "Number of participating companies", "Total number of nominations", "Average annual salary", "Central presentation annual income", "Total annual income presented"]]
#Sort by index descending
df_results = df_results.sort_index(ascending=False)

スクリーンショット 2019-12-22 21.10.47.png

Check the number of participants, the number of participating companies, and the total number of nominations in the past bid results in chronological order.

x = ['1', '2', '3', '4', '5', '6' , '7' , '8' , '9' , '10' , '11' , '12' , '13' , '14' , '15' , '16' , '17' , '18' , '19' , '20' , '21' , '22']
y = []
y2 = []
y3 = []
for i in df_results["The number of participants"]:
    y.append(i)
for i in df_results["Number of participating companies"]:
    y2.append(i)
for i in df_results["Total number of nominations"]:
    y3.append(i)

fig, ax = plt.subplots()
ax.plot(x, y, label='Number of participants')
ax.plot(x, y2, label='Number of participating companies')
ax.plot(x, y3, label='Total nominations')
ax.legend(loc='best')
plt.xlabel('time')
plt.ylabel('number')
plt.show()

スクリーンショット 2019-12-22 21.18.24.png

The number of participants has remained almost unchanged, and the number of participating companies has been increasing little by little. The total number of nominations has decreased by -36% when calculated based on the rate of change (*) from the 15th meeting held at the same time as last year.

Rate of change= \frac{(Reference time point-Reference time of each time)}{Reference time} × 100

Past bid results (average annual income presented / central annual income presented)

Check the average annual income and central annual income of past bid results in chronological order.

x = ['1', '2', '3', '4', '5', '6' , '7' , '8' , '9' , '10' , '11' , '12' , '13' , '14' , '15' , '16' , '17' , '18' , '19' , '20' , '21' , '22']
y = []
y2 = []
for i in df_results["Average annual salary"]:
    y.append(i)
for i in df_results["Central presentation annual income"]:
    y2.append(i)

fig, ax = plt.subplots()
ax.plot(x, y, label='Average presented annual income')
ax.plot(x, y2, label='Centrally presented annual income')
ax.legend(loc='best')
plt.xlabel('time')
plt.ylabel('Amount of money')
plt.show()

スクリーンショット 2019-12-22 21.11.41.png

The average annual income and the central annual income from the 3rd to the 19th were on an upward trend, but it can be seen that they are declining from the 20th to the 22nd, which was held at the end of this year.

Summary of engineer market research results

As far as the participating companies are seen as a premise, there is no SIer system. Most of them are business companies, and the breakdown can be roughly divided into large companies, web-based mega-ventures, and startups. In addition, most companies are based in Tokyo.

Therefore, what can be inferred from the sample data is ** the engineer market value of the operating company **.

The above is a summary of the survey results based on the assumptions.

――Many of the engineers in Kanto who aim to become a business company are 6 million class engineers. --In terms of age, people in their 20s and 30s are active. There are many people in their 20s, and those in their 30s have the largest difference in range. The minimum line for people in their 40s is 6 million yen. There are few specimens for people in their 50s and 60s. ――It seems that companies with active hiring are focusing on hiring, but I think that the mobility of people is also high. (Short years of service and high turnover rate) ――Popular companies have solid products and many companies provide value to the world. --The average annual income and the central annual income are expected to be on the rise in recent years.

reference

Based on the results of the 2018 survey shown in the IT Human Resources White Paper 2019, the total number of domestic IT human resources is ** 1,226,000 **. Among them, IT company IT human resources (IT provider side) are ** 938,000 **, and user company IT human resources (IT user side) ** 288,000 **.

Source: White Paper on IT Human Resources 2019 Chart 1-2-8 Estimating the total number of IT human resources

See "2018 Labor Force Survey Annual Report" of the Labor Force Survey Annual Report, which is a survey material of the Ministry of Internal Affairs and Communications. The average number of regular male employees / employees in 2018 was 5-6.99 million yen at 22.8% (up 0.1 points from the previous year), followed by 3-3.99 million yen at 19.8% (same rate as the previous year). It has become. In the case of men and women, the highest price is 3 to 3.99 million yen.

Source: Statistical table II Detailed tabulation "II-A-Table 3"-"Income from work (annual), number of employees by employment type"

About statistical information on wages of system engineers and programmers from e-Stat, an official statistics portal site where you can browse Japanese statistics. It is shown below.

スクリーンショット 2019-12-24 00.10.07.png

in conclusion

We can see that the data observed in the job change draft is only a small part of the total of about 1.2 million IT personnel in Japan.

Since the term IT human resources is used in a broad sense, it includes system engineers, programmers, and consultants. Most of them can be roughly divided into the information service industry of the information and communication industry and the Internet-related service industry, but since the engineers who install in automobiles such as automobile manufacturing are classified into the automobile manufacturing industry, the actual situation in the entire industry. Turned out to be difficult to estimate.

Looking at the annual labor force survey report of the Ministry of Internal Affairs and Communications, it is observed in the job change draft based on the fact that there are many regular male employees and employees in the 500 to 699 class and statistical data related to e-stat wages. The market value of 6 million yen will be mediocre if you look at the industry as a whole.

When my skills (what I am good at) match what society and the company want, I think that money will come later. Let's make the edge work.

Recommended Posts

What I saw by analyzing the data of the engineer market
I found out by analyzing the reviews of the job change site! ??
I tried to rescue the data of the laptop by booting it on Ubuntu
I tried using the API of the salmon data project
What I learned by participating in the ISUCON10 qualifying
[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo
Gzip the data by streaming
Analyzing data on the number of corona patients in Japan
[Python] I tried collecting data using the API of wikipedia
I sent the data of Raspberry Pi to GCP (free)
I studied four libraries of Python 3 engineer certified data analysis exams
Can I pass the first grade of math test by programming?
What happens when I change the hyperparameters of SVM (RBF kernel)?
I investigated the mechanism of flask-login!
Analyzing the age-specific severity of coronavirus
10 selections of data extraction by pandas.DataFrame.query
Animation of geographic data by geopandas
I tried to open the latest data of the Excel file managed by date in the folder with Python
[Note] I want to completely preprocess the data of the Titanic issue-Age version-
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to verify and analyze the acceleration of Python by Cython
I tried to verify the result of A / B test by chi-square test
Explain the mechanism of PEP557 data class
I checked the contents of docker volume
I tried the asynchronous server of Django 3.0
I checked the options of copyMakeBorder of OpenCV
The story of verifying the open data of COVID-19
Get the column list & data list of CASTable
I summarized the folder structure of Flask
What is the cause of the following error?
I didn't know the basics of Python
I saved the scraped data in CSV!
Pandas of the beginner, by the beginner, for the beginner [Python]
The Python project template I think of.
I touched the data preparation tool Paxata
Visualize the export data of Piyo log
I want to visualize the transfer status of the 2020 J League, what should I do?
What are the characteristics of an AV actress? I guessed from the title of the work! (^ _ ^) / ~~
Test whether the observed data follow the Poisson distribution (Test of the goodness of fit of the Poisson distribution by Python)
I tried to predict the presence or absence of snow by machine learning.
Paste a link to the data point of the graph created by jupyterlab & matplotlib
[GAN] I saw the darkness trying to further evolve the final evolution of Pokemon